Singapore

Large language models (LLMs) are widely adopted across diverse AI applications.
To align LLM behavior with human values, Reinforcement Learning from Human Feedback (RLHF) employs a reward model (RM) as a proxy for human preferences to guide policy optimization.
Consequently, the accuracy, reliability, and interpretability of the RM critically influence downstream alignment outcomes.
However, conventional scalar RMs are both opaque and rigid, offering little insight into reward reasoning and lacking adaptability to evolving preferences.
While recent work on multidimensional RMs has sought to improve interpretability, these methods often fall short in feature-level attribution and incur substantial annotation costs.
To address these challenges, we propose the Sparse Autoencoder-enhanced Reward Model (\textbf{SARM}), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into the reward modeling pipeline.
Specifically, SARM projects LLM hidden activations into a sparse monosemantic feature space, with a scalar head aggregating these features to produce reward scores attributable to interpretable concepts.
Experiments demonstrate that SARM enables direct attribution of reward scores to interpretable feature activations, supports dynamic preference adjustment, and outperforms standard scalar RMs in alignment tasks.

AAAI 2026

Interpretable Reward Model via Sparse Autoencoder

allignment

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

This talk will introduce knowledge-guided machine learning
(KGML), a rapidly growing field of research where
scientific knowledge is deeply integrated in machine
learning frameworks to produce scientifically grounded,
explainable, and generalizable predictions even on
out-of-distribution data. This talk will present a
multi-dimensional view to organize prior research in KGML
in terms of the nature and format of scientific knowledge
used, the form of knowledge-ML integration explored, and
the method for incorporating scientific knowledge in ML for
diverse scientific use-cases. These KGML concepts will be
illustrated using a variety of case studies in ecology,
biology, and public health including modeling the quality
of water in lakes across the US and discovering novel
biological traits of organisms linked with evolution from
biodiversity images. The talk will conclude with a
discussion of emerging opportunities in KGML especially in
the age of generative AI and Foundation models with
potential applications in a broad range of scientific
disciplines.

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

The research trend of metacognitive AI deals with the study of artificial intelligence systems that can self-monitor and/or regulate resources. This concept has its roots in cognitive psychology studies on human metacognition. It has led to the understanding of how people monitor, control, and communicate their cognitive processes. An emerging research trend in artificial intelligence is to build systems that possess these capabilities. This paper summarizes the key ideas about metacognition from cognitive psychology, describes recent attempts to instantiate these concepts in AI systems, and discusses metacognitive capabilities observed in humans that are not thoroughly explored in AI research.

Toward Artificial Metacognition

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Deep multi-modal clustering fully learns semantically consistent and discriminative cluster representations between multiple modalities in an unlabeled manner. However, existing methods treat all samples equally, ignoring varying sample quality, which limits clustering performance. Inspired by the concept of interest in the recommendation system, we propose a novel interest-driven deep multi-modal clustering (IDMC) framework. It designs a new paradigm to quantify the importance of each sample base on the attention it receives from other samples, which called interest value. This value jointly captures the local geometric structure through the Euclidean distance in feature space and the consistency of pseudo-labels. Then, we design a novel adaptive Bayesian fusion mechanism to dynamically balance the prior features and self-supervisory signals to ensure confidence-based sample importance estimation. Furthermore, we introduce a median normalization constraint and a label consistency constraint to further refine the construction of the interest value. By embedding this interest-guided value into representation learning and cluster optimization, IDMC focuses on the samples with the most information and the most stable semantics, thereby enhancing the performance of multi-modal representation learning. Extensive experiments verify that IDMC is superior to existing state-of-the-art methods in multiple evaluation metrics.

Interest-driven Deep Multi-modal Clustering

Graph Neural Networks (GNNs) have received increasing attention due to their ability to handle graph-structured data, yet their explainability remains a significant challenge. An effective solution is to provide the GNN models with counterfactual explanations, which aim to answer “How should the input instance be perturbed to change the model's prediction?". However, existing works mainly focus on generating explanations that can effectively alter model predictions, while neglecting whether the explanations remain aligned with the original data distribution, leading to the distribution shift problem. To address this problem, we propose a novel method called ICExplainer for generating explanations within the original distribution. Specifically, we introduce graph diffusion-based generative model into the counterfactual reasoning, treating it as an optimization objective for graph distribution learning. Taking insights from variational inference, we use it to estimate the true distribution of the input graphs to retain essential structural and semantic information. The inferred distribution is then utilized as prior knowledge to guide the reverse process, ensuring that generated explanations are both counterfactual and distributionally coherent. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the superior performance of ICExplainer over existing methods.

Generating In-Distribution Counterfactual Explanation for Graph Neural Networks

Fine-grained urban flow inference is crucial for urban planning and intelligent transportation systems, enabling precise traffic management and resource allocation.
However, the practical deployment of existing methods is hindered by two key challenges: the prohibitive computational cost of over-parameterized models and the suboptimal performance of conventional loss functions on the highly skewed distribution of urban flows.
To address these challenges, we propose a unified solution that synergizes architectural efficiency with adaptive optimization.
Specifically, we first introduce **PLGF**, a lightweight yet powerful architecture that employs a Progressive Local-Global Fusion strategy to effectively capture both fine-grained details and global contextual dependencies. 
Second, we propose **DualFocal** Loss, a novel function that integrates dual-space supervision with a difficulty-aware focusing mechanism, enabling the model to adaptively concentrate on hard-to-predict regions. 
Extensive experiments on 4 real-world scenarios validate the effectiveness and scalability of our method. 
Notably, while achieving state-of-the-art performance, PLGF reduces the model size by up to 97\% compared to current high-performing methods. Furthermore, under comparable parameter budgets, our model yields an accuracy improvement of over 10\% against strong baselines.
The implementation is included in the **supplementary material**.

Boosting Fine-Grained Urban Flow Inference via Lightweight Architecture and Focalized Optimization

Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8\% of the image tokens, CATP produces an average performance gain of 0.6\% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78\% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

With the widespread deployment of large language models (LLMs) in human-computer interaction, dark patterns have extended from traditional visual interfaces to conversational AI systems. While existing research has confirmed the prevalence of dark patterns in LLMs, current evaluation benchmarks face critical challenges including limited classification coverage, overlooked risks specific to reasoning models, and inadequate consideration of cross-linguistic differences. To address these limitations, we propose DarkBench+, an extended benchmark for evaluating dark patterns in LLMs. We construct an expanded taxonomy containing 10 major categories and 24 subcategories, introduce an annotation workflow combining manual and automated methods, and design 2,088 bilingual test samples in Chinese and English. This benchmark is the first to develop specialized evaluation dimensions for reasoning models and systematically evaluates dark pattern behaviors across nearly 40 mainstream LLMs. Experimental results demonstrate significant manipulation risks in reasoning models' transparency displays, while cross-linguistic evaluation analyzes AI manipulation behavior differences across different linguistic environments, promoting more ethical and responsible LLM development.

DarkBench+: An Extended Benchmark for Evaluating Dark Patterns in Large Language Models

This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between
47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems for operational use, as all known work has been confined 
to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The development of the models consisted of training on the largest known dataset of post-disaster sUAS aerial imagery, which consists of 21,716 building damage labels, and the operational training of 91 disaster practitioners. The deployment of the system was during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes detailed documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.

Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response

Deep learning models are designed based on the i.i.d.
assumption; consequently, they experience a significant
performance drop due to the distribution shifts when
deployed in real environments. Domain Generalisation (DG)
aims to bridge the distribution shift between the source
and target domains by improving the generalisability of the
model to Out-Of-Distribution (OOD) data. This challenge is
prominent in satellite imagery classification due to the
scarcity of data from underrepresented regions such as
Africa and Oceania. In this paper, we address the
limitations of existing datasets in capturing distribution
shifts caused by geospatial differences between geographic
regions by constructing a new, large-scale dataset called
Domain Shift across Geographic Regions (DSGR). This dataset
aims to help researchers better understand the impact of
distribution shifts on satellite imagery classification.
Furthermore, we perform rigorous experiments on DSGR to
investigate and benchmark the robustness of existing DG
techniques under single- and multi-source domain settings
and the role of foundation models in enhancing the DG
techniques. Our evaluations reveal that recent DG
techniques have a comparable, yet weak, performance on
DSGR. However, when combined with a foundation model like
CLIP, ERM (introduced in 1999) achieves highly competitive
results, surpassing even recent state-of-the-art DG
solutions in enhancing the generalisability of deep
learning models across different geographic regions. Our
dataset and code are available at
https://github.com/RWGAI/DSGR.



Downloads

Next from AAAI 2026

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Knowledge-Guided Machine Learning: A Paradigm Shift in AI for Science

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads