AAAI 2026

January 22, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large language models (LLMs) are widely adopted across diverse AI applications. To align LLM behavior with human values, Reinforcement Learning from Human Feedback (RLHF) employs a reward model (RM) as a proxy for human preferences to guide policy optimization. Consequently, the accuracy, reliability, and interpretability of the RM critically influence downstream alignment outcomes. However, conventional scalar RMs are both opaque and rigid, offering little insight into reward reasoning and lacking adaptability to evolving preferences. While recent work on multidimensional RMs has sought to improve interpretability, these methods often fall short in feature-level attribution and incur substantial annotation costs. To address these challenges, we propose the Sparse Autoencoder-enhanced Reward Model (\textbf{SARM}), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into the reward modeling pipeline. Specifically, SARM projects LLM hidden activations into a sparse monosemantic feature space, with a scalar head aggregating these features to produce reward scores attributable to interpretable concepts. Experiments demonstrate that SARM enables direct attribution of reward scores to interpretable feature activations, supports dynamic preference adjustment, and outperforms standard scalar RMs in alignment tasks.

Downloads

SlidesPaperTranscript English (automatic)

Next from AAAI 2026

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science
technical paper

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

AAAI 2026

Xiaowei Jia and 2 other authors

22 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved