Singapore

Quality of datasets plays an important role in large language model (LLM) alignment.
In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation;
the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs.
To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. 
We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages;
in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model.
Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns.
In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms.
In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.
The model implementation details and the code, as well as a complete manuscript with colored hyperlinks and technical appendix for better digital viewing, are included as supplementary materials and scheduled to be open-sourced upon publication.

AAAI 2026

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

safety and robustness

(large) language models

rlhf

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Safe Multi-Agent Reinforcement Learning (MARL) typically requires specifying numerical cost functions to ensure policy behaviors adhere to safety constraints. As systems scale and human-defined constraints become diverse, context-dependent, and frequently updated, manual crafting of these numerical cost definitions becomes prohibitively complex, tedious, and error-prone. Natural language presents an intuitive yet powerful alternative for defining constraints, enabling broader accessibility and easier adaptability to new scenarios and evolving rules. However, current MARL frameworks lack effective mechanisms to incorporate free-form textual constraints intelligently and robustly. To bridge this gap, we introduce Safe Multi-Agent ReinforcementLearning with natural Language constraints (SMALL), a novel approach leveraging fine-tuned language models to parse and encode textual constraints into semantically meaningful embeddings. These embeddings reflect prohibited states or behaviors, thus allowing automated and accurate prediction of constraint violations. We integrate these learned embeddings directly into MARL frameworks, enabling agents to optimize task performance while simultaneously minimizing constraint violations, all without relying upon explicitly defined numeric penalties. To rigorously evaluate our method, we also propose the LaMaSafe benchmark—a set of diverse multi-agent tasks uniquely designed to assess the capability of MARL algorithms in understanding and adhering to realistic, human-provided natural language constraints. Experimental results across various LaMaSafe environments demonstrate that SMALL achieves comparable task performance to state-of-the-art baselines while significantly reducing constraint violations.

Safe Multi-agent Reinforcement Learning with Natural Language Constraints

Humans display significant uncertainty when faced with moral dilemmas, yet the extent of such uncertainty in large language models (LLMs) remains underexplored. In contrast, studies have confirmed the tendency of LLMs to be overly confident in their judgments, even as they are embedded in ethical decision-making frameworks, necessitating a deeper understanding of their moral reasoning and inherent uncertainties for building reliable AI systems. This work examines how uncertainties affect moral decisions in trolley problems across 32 open-source LLMs, spanning 9 distinct moral dimensions. Our analysis reveals that the variance in LLM confidence is greater among different models than it is within moral dimensions, indicating that moral uncertainty is predominantly shaped by the LLM architecture and training methodology. Next, we measure uncertainty via binary entropy and decompose it into total entropy, conditional entropy, and mutual information. To explore the effect of uncertainty in models, we deliberately added stochasticity in models via “dropout” at inference time. Our findings indicate that this intervention leads to a higher total entropy, primarily through an increase in mutual information, while conditional entropy remains largely unchanged. This intervention further yields significant improvements in human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLM’s confidence in morally complex scenarios.

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or "empty set" 
response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM's own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.

Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

Adversarial Security of Financial Language Models (ASFLM) is critical as Large Language Models (LLMs) pervade high-stakes financial applications. However, LLMs face two key challenges: their vulnerability to damaging adversarial attacks and the prevalent research gap concerning robust defenses against sophisticated, semantically coherent threats. To address these, we first theoretically analyze the relationship between discrete and continuous adversarial optimization, proving the continuous optimum provides a lower bound for the discrete. This foundation supports our novel two-stage framework, ChameleonAttack. It employs Adaptive Latent-space Optimization (ALO) for potent adversarial token discovery, followed by a Semantic-Translation Module (STM) module to generate fluent, coherent, and natural-sounding adversarial text. This dual approach aims to maximize attack impact while ensuring high linguistic quality and semantic integrity for evasion. Evaluated on state-of-the-art financial LLMs (e.g., FinBERT) and standard benchmarks (e.g., Financial PhraseBank), ChameleonAttack achieves a high Attack Success Rate (ASR) of 93.4%. These results highlight significant practical vulnerabilities and underscore the urgent need for robust defense mechanisms in the financial domain.

Semantics-Preserving Adversarial Attacks on Event-Driven Stock Prediction Models

Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

Although large language models (LLMs) are increasingly trained using human feedback for safety and alignment with human values, alignment decisions systematically overlook human social diversity. Researchers have advocated for sociotechnical approaches integrating pluralistic values into model training. This study examines how operationalizing these approaches shapes LLM behavior through a systematic evaluation of demographic variation and design parameters within the alignment pipeline. We built an inclusive data collection process, obtaining alignment datasets from US and German participants (N = 1,095 participants, 27,375 ratings) who rated LLM responses across five dimensions: toxicity, emotional awareness, sensitivity, stereotypical bias, and helpfulness. We fine-tuned multiple Large Language Models and Large Reasoning Models using value preferences from different social groups while varying key parameters, including rating scale, disagreement handling, and preference optimization method. Results show systematic demographic effects: male participants rated responses as 18% less toxic than female participants; conservative and Black participants reported 27.9% and 58% higher emotional awareness ratings than liberal and White participants, respectively. Models finetuned on group-specific preferences exhibited distinct behaviors. Technical decisions had even greater impact on toxicity reduction: preserving rater disagreement was 1.6 times more effective than majority vote aggregation; 5-point scales outperformed binary formats by 1.23 times; Direct Preference Optimization (DPO) consistently outperformed Group-based Relative Preference Optimization (GRPO) on both toxicity and emotional awareness optimization. Meta-analysis confirms the robustness of these effects, raising a critical question: How should alignment processes balance expert-driven and user-driven signals to ensure both safety and fair representation?

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Present day LLMs face the challenge of managing affordance-based safety risks—situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.

AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models

Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations. At its core, LieCraft is a novel multiplayer hidden-role game in which players select an ethical alignment and execute strategies over a long time-horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real-world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high-stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state-of-the-art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

3D generation and reconstruction techniques have been widely used in computer games, film, and other content creation areas. As the application grows, there is a growing demand for 3D shapes that look truly realistic. Traditional evaluation methods rely on a ground truth to measure mesh fidelity. However, in many practical cases, a shape's realism does not depend on having a ground truth reference. In this work, we propose a Shape-Realism Alignment Metric that leverages a large language model (LLM) as a bridge between mesh shape information and realism evaluation. To achieve this, we adopt a mesh encoding approach that converts 3D shapes into the language token space. A dedicated realism decoder is designed to align the language model’s output with human perception of realism. Additionally, we introduce a new dataset, RealismGrading, which provides human-annotated realism scores without the need for ground truth shapes. Our dataset includes shapes generated by 16 different algorithms on over a dozen objects, making it more representative of practical 3D shape distributions. We validate our metric's performance and generalizability through k-fold cross-validation across different objects. Experimental results show that our metric correlates well with human perceptions and outperforms existing methods, and has good generalizability.

SRAM: Shape-Realism Alignment Metric for No Reference 3D Shape Evaluation

Large Audio Language Models (LALMs) are transforming AI by directly processing and generating human language from audio. As these models proliferate in real-world applications, evaluating their performance for equitable and safe use across diverse linguistic and cultural contexts becomes paramount. This paper presents the first comprehensive study on cultural preferences and biases in LALMs across multilingual and multicultural settings. We extend existing cultural harm frameworks from text-based models to the audio domain, analysing how linguistic and cultural diversity influence LALM behaviour, sensitivity, and output quality. Our research uncovers unique challenges in interpreting cultural nuances from audio and linguistic variations. We introduce a novel multilingual audio-text dataset (10 languages, including English), \textbf{Audio Cultural Intelligence Dataset (ACID Benchmark) spanning 1315 hours in audio length}, specifically for evaluating LALM cultural biases, marking the first such examination in this emerging area. Our \textbf{comprehensive analysis includes 10 open-source and 2 closed-source models}, demonstrating significant performance disparities across languages and cultural contexts, highlighting the audio modality's influence on bias manifestation. These findings highlight the critical need to evaluate LALMs not only for technical accuracy but also for fair and culturally sensitive performance, urging the development of inclusive datasets and cultural awareness for building safer and more equitable large audio language models. The ACID benchmark will be made publicly available.

Content not yet available

Next from AAAI 2026

Safe Multi-agent Reinforcement Learning with Natural Language Constraints

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES