Singapore

Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias—a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.

AAAI 2026

Mitigating Length Bias in RLHF Through a Causal Lens

nlp: (large) language models

ml: causal learning

hai: learning human values and preferences

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

High-quality multi-hop instruction data is critical for enhancing the reasoning capabilities of large language models (LLMs) in complex long-context scenarios, e.g., long-form reasoning. Nevertheless, there is currently a notable scarcity of such datasets within the community, and existing data synthesis approaches typically fail to provide explicit modeling of intermediate reasoning steps, resulting in unverifiable and potentially erroneous samples. To mitigate above issue, we design the **C**oncept-**G**raph based **M**ulti-hop **I**nstruction **S**ynthesis (CGMIS) framework, which constructs long-form reasoning paths via concept graph traversal and automatically generates verifiable multi-hop data. The CGMIS framework not only guarantees the accuracy and verifiability of the synthesized data but also enables the construction of high-quality multi-hop instruction datasets from arbitrary corpora. Experiments show that fine-tuning with CGMIS-generated data achieves state-of-the-art performance across 13 long-context reasoning tasks on various models, using only 10% of the data volume required by existing methods.

CGMIS: Concept-Graph Based Multi-Hop Instructions Synthesis for Enhancing Long-Context Reasoning

This paper tackles the fundamental failure of Large Language Models (LLMs) to solve new tasks when prompted with a sufficient, yet overly complex, set of multi-modal episodes. This failure stems from the model's inability to distill underlying patterns from the noisy experiences. We propose Hypothesis-Driven Reasoning (HDR), a framework that enhances LLM reasoning by building an explicit semantic memory—a set of hypotheses induced from the multi-modal episodes. HDR employs a two-stage pipeline. It first extracts potential factors from the episodes and then iteratively refines hypotheses by generate-verify loop with the factors. We first empirically demonstrates this failure and the potential of sematic memory, showing that oracle hypotheses can boost accuracy from 35.3\% to 92.0\% on a novel task we designed. We then evaluate our HDR, achieving near-oracle performance and significantly outperforming baselines, especially on smaller models. This paper validates a shift from unstructured in-context recall to explicit knowledge abstraction for robust reasoning.

Hypothesis-Driven Reasoning for Large Language Models

In modern software development workflows, the open-source software supply chain significantly contributes to efficient and convenient engineering practices. With increasing system complexity, it has become a common practice to use open-source software as third-party dependencies. However, due to the lack of maintenance for underlying dependencies and insufficient community auditing, ensuring the security of source code and the legitimacy of repository maintainers has become a challenge, particularly in the context of high-stealth backdoor attacks such as the XZ-Util incident. To address these problems, we propose a fine-grained project evaluation framework for backdoor risk assessment in open-source software. Our evaluation framework models highly stealthy backdoor attacks from the attacker’s perspective and defines targeted metrics for each attack stage. Moreover, to overcome the limitations of static analysis in assessing the reliability of repository maintenance activities—such as irregular committer privilege escalation and insufficient review participation—we employ large language models (LLMs) to perform semantic evaluation of code repositories while avoiding reliance on manually crafted patterns. The effectiveness of our framework is validated on 156 high-priority Debian packages, and the experimental results reveal that the current open-source software supply chain is exposed to a series of security risks.

An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.
The associated code and dataset will be released publicly.

DeepPhy: Benchmarking Agentic VLMs on Physical Reasoning

Partial Domain Adaptation (PDA) aims to generalize a classification model learned on a labeled source domain to an unlabeled target domain, where the target label space is a subset of the source label space. In PDA tasks, existing methods typically achieve transferability through distribution alignment in a statistical framework, and discriminability through geometric modeling. These two aspects are often treated as separate frameworks, which severs the intrinsic connection between them.
To bridge this gap, we propose a unified framework termed Geometry-aware Conditional Alignment (GCA), which is derived from theoretical insights of MCRR. GCA collaboratively achieves conditional alignment and orthogonal discriminability in a unified framework, making the learned features more interpretable in both statistical and geometric aspects.
Extensive experiments on four benchmark datasets are conducted to demonstrate the effectiveness of GCA.

GCA: Geometry-aware Conditional Alignment for Partial Domain Adaptation with Coding Rate Reduction

Anomaly detection in dynamic graphs aims to capture the dynamic evolution characteristics of graphs, and then identify abnormal behaviors that deviate from normal patterns. However, previous studies fail to decouple periodic and bursty information during the time encoding process, which hinders their performances. In addition, most existing methods use attention mechanisms to capture the importance of time points. They fail to leverage the normal and abnormal characteristics in the frequency domain. To address the above issues, we propose a model that integrates multi-scale Frequency encoding with Time-frequency Attention for Anomaly Detection in dynamic graphs, named FreqTAD. We design a multi-scale frequency encoder that decomposes time series into distinct periodic and bursty components. Moreover, we present an effective time-frequency attention mechanism that focuses on frequency components to differentiate frequency-domain features of normal and abnormal behaviors. Experimental results on four datasets demonstrate the superior performance of FreqTAD in both anomaly detection accuracy and computational efficiency.

FreqTAD: Multi-scale Frequency Encoding and Time-Frequency Attention for Anomaly Detection in Dynamic Graphs

Multi-agent systems (MAS) powered by large language models (LLMs) hold significant promise for solving complex decision-making tasks. However, the core process of collaborative decision-making (CDM) within these systems remains underexplored. Existing approaches often rely on either "dictatorial" strategies that are vulnerable to the cognitive biases of a single agent, or "voting-based" methods that fail to fully harness collective intelligence. To address these limitations, we propose **AgentCDM**, a structured framework for enhancing collaborative decision-making in LLM-based multi-agent systems. Drawing inspiration from the Analysis of Competing Hypotheses (ACH) in cognitive science, AgentCDM introduces a structured reasoning paradigm that systematically mitigates cognitive biases and shifts decision-making from passive answer selection to active hypothesis evaluation and construction. To internalize this reasoning process, we develop a two-stage training paradigm: the first stage uses explicit ACH-inspired scaffolding to guide the model through structured reasoning, while the second stage progressively removes this scaffolding to encourage autonomous generalization. Experiments on multiple benchmark datasets demonstrate that AgentCDM achieves state-of-the-art performance and exhibits strong generalization, validating its effectiveness in improving the quality and robustness of collaborative decisions in MAS. We release our code at https://anonymous.4open.science/status/agent_cdm-5931.

AgentCDM: Enhancing Multi-Agent Collaborative Decision-Making via ACH-Inspired Structured Reasoning

Large language models (LLMs) suffer from a lack of decision-making transparency, limiting their deployment in high-stakes domains such as healthcare. We propose a mechanistic interpretability framework that introduces two novel paradigms: Medical Fine-Tuning with Frozen Attention Layers (FTFA) and Posterior Adaptation Transcoders (PAT). FTFA freezes attention layers while fine-tuning only feed-forward network (FFN) parameters, enabling PAT to efficiently adapt pre-trained transcoders on the same data. This approach achieves over 1000× efficiency improvement compared to training transcoders from scratch. We theoretically justify this methodology and demonstrate its cost-effectiveness for cross-domain transfer. Transcoders are sparse autoencoders that replace MLP layers to provide interpretable feature representations. By substituting MLP layers of both base Gemma2-2b and its medical fine-tuned variant with per-layer transcoders, we enable feature-level attribution analysis. Through systematic pruning and node merging of resulting attribution graphs, we construct human-interpretable decision pathways. Our analysis reveals that LLMs employ two parallel mechanisms for medical diagnosis: pattern matching and multi-hop reasoning, with fine-tuned models demonstrating enhanced correct reasoning patterns. This work provides a practical framework for training transcoders on fine-tuned models at minimal cost, enabling broader application of mechanistic interpretability across domains and potentially guiding model training through transcoder-based analysis.

Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models

Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. 
Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. 
To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the $\textit{instruction}$ component, operating at either $\underline{macroscopic}$ (entire instruction semantics) or $\underline{mesoscopic}$ levels (instruction units), and furthermore introduces a novel analysis of $\underline{microscopic}$ diversity within the $\textit{response}$ component, specifically analyzing the statistical distribution of **tokens** in SFT training samples.
In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. 
We then fine-tune LLMs on these datasets to assess the six diversity-control strategies.
Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies.
These findings offer actionable insights for constructing high-performance SFT datasets.

From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.

Downloads

Next from AAAI 2026

CGMIS: Concept-Graph Based Multi-Hop Instructions Synthesis for Enhancing Long-Context Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

CGMIS: Concept-Graph Based Multi-Hop Instructions Synthesis for Enhancing Long-Context Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads