AAAI 2026

January 23, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of \emph{proxy-policy conflicts}, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of \emph{shared ignorance}, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized \textit{Proxy-Policy Alignment Conflict Score (PACS)} and a global \textit{Kendall-Tau Distance} measure. Building on this insight, we design an algorithm named \textbf{Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS)} that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.

Downloads

Paper

Next from AAAI 2026

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation
poster

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

AAAI 2026

+3
Xiaojun Chen and 5 other authors

23 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved