Singapore

Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of \emph{proxy-policy conflicts}, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of \emph{shared ignorance}, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized \textit{Proxy-Policy Alignment Conflict Score (PACS)} and a global \textit{Kendall-Tau Distance} measure. Building on this insight, we design an algorithm named \textbf{Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS)} that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.

AAAI 2026

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

reward misspecification

llm alignment

ai safety

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Generative vision-language models like Stable Diffusion demonstrate remarkable capabilities in creative media synthesis, but they also pose substantial risks of producing unsafe, offensive, or culturally inappropriate content when prompted adversarially. Current defenses struggle to align outputs with human values without sacrificing generation quality or incurring high costs.
To address these challenges, we introduce VALOR (Value-Aligned LLM-Overseen Rewriter), a modular, zero-shot agentic framework for safer and more helpful text-to-image generation. VALOR integrates layered prompt analysis with human-aligned value reasoning: a multi-level NSFW detector filters lexical and semantic risks; a cultural value alignment module identifies violations of social norms, legality, and representational ethics; and an intention disambiguator detects subtle or indirect unsafe implications. When unsafe content is detected, prompts are selectively rewritten by a large language model under dynamic, role-specific instructions designed to preserve user intent while enforcing alignment. If the generated image still fails a safety check, VALOR optionally performs a stylistic regeneration to steer the output toward a safer visual domain without altering core semantics. Experiments across adversarial, ambiguous, and value-sensitive prompts show that VALOR significantly reduces unsafe outputs by up to 100.00% while preserving prompt usefulness and creativity. These results highlight VALOR as a scalable and effective approach for deploying safe, aligned, and helpful image generation systems in open-world settings.

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

We introduce PASTA (Perceptual Assessment System for explanaTion of Artificial Intelligence), a novel human-centric framework for evaluating eXplainable AI (XAI) techniques in computer vision. Our first contribution is the creation of the PASTA-dataset, the first large-scale benchmark that spans a diverse set of models and both saliency-based and concept-based explanation methods. This dataset enables robust, comparative analysis of XAI techniques based on human judgment. Our second contribution is an automated, data-driven benchmark that predicts human preferences using the PASTA-dataset. This scoring called PASTA-score method offers scalable, reliable, and consistent evaluation aligned with human perception. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. We then propose to apply our scoring method to probe the interpretability of existing models and to build more human interpretable XAI methods.

Benchmarking XAI Explanations with Human-Aligned Evaluations

Quality of datasets plays an important role in large language model (LLM) alignment.
In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation;
the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs.
To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. 
We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages;
in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model.
Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns.
In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms.
In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.
The model implementation details and the code, as well as a complete manuscript with colored hyperlinks and technical appendix for better digital viewing, are included as supplementary materials and scheduled to be open-sourced upon publication.

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

Safe Multi-Agent Reinforcement Learning (MARL) typically requires specifying numerical cost functions to ensure policy behaviors adhere to safety constraints. As systems scale and human-defined constraints become diverse, context-dependent, and frequently updated, manual crafting of these numerical cost definitions becomes prohibitively complex, tedious, and error-prone. Natural language presents an intuitive yet powerful alternative for defining constraints, enabling broader accessibility and easier adaptability to new scenarios and evolving rules. However, current MARL frameworks lack effective mechanisms to incorporate free-form textual constraints intelligently and robustly. To bridge this gap, we introduce Safe Multi-Agent ReinforcementLearning with natural Language constraints (SMALL), a novel approach leveraging fine-tuned language models to parse and encode textual constraints into semantically meaningful embeddings. These embeddings reflect prohibited states or behaviors, thus allowing automated and accurate prediction of constraint violations. We integrate these learned embeddings directly into MARL frameworks, enabling agents to optimize task performance while simultaneously minimizing constraint violations, all without relying upon explicitly defined numeric penalties. To rigorously evaluate our method, we also propose the LaMaSafe benchmark—a set of diverse multi-agent tasks uniquely designed to assess the capability of MARL algorithms in understanding and adhering to realistic, human-provided natural language constraints. Experimental results across various LaMaSafe environments demonstrate that SMALL achieves comparable task performance to state-of-the-art baselines while significantly reducing constraint violations.

Safe Multi-agent Reinforcement Learning with Natural Language Constraints

Humans display significant uncertainty when faced with moral dilemmas, yet the extent of such uncertainty in large language models (LLMs) remains underexplored. In contrast, studies have confirmed the tendency of LLMs to be overly confident in their judgments, even as they are embedded in ethical decision-making frameworks, necessitating a deeper understanding of their moral reasoning and inherent uncertainties for building reliable AI systems. This work examines how uncertainties affect moral decisions in trolley problems across 32 open-source LLMs, spanning 9 distinct moral dimensions. Our analysis reveals that the variance in LLM confidence is greater among different models than it is within moral dimensions, indicating that moral uncertainty is predominantly shaped by the LLM architecture and training methodology. Next, we measure uncertainty via binary entropy and decompose it into total entropy, conditional entropy, and mutual information. To explore the effect of uncertainty in models, we deliberately added stochasticity in models via “dropout” at inference time. Our findings indicate that this intervention leads to a higher total entropy, primarily through an increase in mutual information, while conditional entropy remains largely unchanged. This intervention further yields significant improvements in human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLM’s confidence in morally complex scenarios.

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or "empty set" 
response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM's own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.

Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

Adversarial Security of Financial Language Models (ASFLM) is critical as Large Language Models (LLMs) pervade high-stakes financial applications. However, LLMs face two key challenges: their vulnerability to damaging adversarial attacks and the prevalent research gap concerning robust defenses against sophisticated, semantically coherent threats. To address these, we first theoretically analyze the relationship between discrete and continuous adversarial optimization, proving the continuous optimum provides a lower bound for the discrete. This foundation supports our novel two-stage framework, ChameleonAttack. It employs Adaptive Latent-space Optimization (ALO) for potent adversarial token discovery, followed by a Semantic-Translation Module (STM) module to generate fluent, coherent, and natural-sounding adversarial text. This dual approach aims to maximize attack impact while ensuring high linguistic quality and semantic integrity for evasion. Evaluated on state-of-the-art financial LLMs (e.g., FinBERT) and standard benchmarks (e.g., Financial PhraseBank), ChameleonAttack achieves a high Attack Success Rate (ASR) of 93.4%. These results highlight significant practical vulnerabilities and underscore the urgent need for robust defense mechanisms in the financial domain.

Semantics-Preserving Adversarial Attacks on Event-Driven Stock Prediction Models

Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

Although large language models (LLMs) are increasingly trained using human feedback for safety and alignment with human values, alignment decisions systematically overlook human social diversity. Researchers have advocated for sociotechnical approaches integrating pluralistic values into model training. This study examines how operationalizing these approaches shapes LLM behavior through a systematic evaluation of demographic variation and design parameters within the alignment pipeline. We built an inclusive data collection process, obtaining alignment datasets from US and German participants (N = 1,095 participants, 27,375 ratings) who rated LLM responses across five dimensions: toxicity, emotional awareness, sensitivity, stereotypical bias, and helpfulness. We fine-tuned multiple Large Language Models and Large Reasoning Models using value preferences from different social groups while varying key parameters, including rating scale, disagreement handling, and preference optimization method. Results show systematic demographic effects: male participants rated responses as 18% less toxic than female participants; conservative and Black participants reported 27.9% and 58% higher emotional awareness ratings than liberal and White participants, respectively. Models finetuned on group-specific preferences exhibited distinct behaviors. Technical decisions had even greater impact on toxicity reduction: preserving rater disagreement was 1.6 times more effective than majority vote aggregation; 5-point scales outperformed binary formats by 1.23 times; Direct Preference Optimization (DPO) consistently outperformed Group-based Relative Preference Optimization (GRPO) on both toxicity and emotional awareness optimization. Meta-analysis confirms the robustness of these effects, raising a critical question: How should alignment processes balance expert-driven and user-driven signals to ensure both safety and fair representation?

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Present day LLMs face the challenge of managing affordance-based safety risks—situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.

Downloads

Next from AAAI 2026

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads