Singapore

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly training on offline preference data to align with human preferences. During DPO training, the reference model serves as a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the absence of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and requires stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that improves preference optimization by introducing a guiding reference model. This reference model provides foresight into the desired policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on the AlpacaEval 2 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data. Our code, data, and technical appendix can be found in the Supplementary Material.

AAAI 2026

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

nlp: preference optimization

nlp: learning & optimization for nlp

nlp: (large) language models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Partial Label Learning (PLL) aims to train multi-class classifiers from examples where each instance is associated with a set of candidate labels, among which the ground-truth label is assumed to be included. While most existing studies assume that partial labels are both instance-independent and reliable, such assumptions often break down in real-world scenarios, where candidate sets may depend on instance-specific features and even exclude the ground-truth label. In this work, we investigate a more realistic setting termed Unreliable Instance-Dependent Partial Label Learning (UIDPLL). To address the challenges in UIDPLL, we propose a novel framework named Neighborhood-guided Label Augmentation and Pruning (NLAP). NLAP exploits the structural consistency among neighboring instances to progressively refine candidate label sets and integrates classifier feedback to disambiguate labels during training. This progressive mechanism improves classification performance by tackling ambiguity caused by noise and instance dependency in partial labels. Furthermore, we provide theoretical guarantees for the proposed NLAP framework, demonstrating that label ambiguity can be effectively reduced through appropriate refinement and pruning procedures. Extensive experiments on both benchmark and real-world datasets demonstrate the robustness and effectiveness of the proposed method.

Neighbor-aware Label Refinement: Enhancing Unreliable Instance-Dependent Partial Labels

Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators

Molecular structure generation from mass spectrometry is fundamental for understanding cellular metabolism and discovering novel compounds. Although tandem mass spectrometry (MS/MS) enables the high-throughput acquisition of fragment fingerprints, these spectra often reflect higher-order interactions involving the concerted cleavage of multiple atoms and bonds-crucial for resolving complex isomers and non-local fragmentation mechanisms.
However, most existing methods adopt atom-centric and pairwise interaction modeling, overlooking higher-order edge interactions and lacking the capacity to systematically capture essential many-body characteristics for structure generation.
To overcome these limitations, we present MBGen, a Many-Body enhanced diffusion framework for de novo molecular Generation from mass spectra.
By integrating a novel many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra, enabling accurate de novo generation and isomer differentiation for novel molecules. 
Experimental results on the NPLIB1 and MassSpecGym benchmarks demonstrate that MBGen achieves superior performance, with improvements of up to 230% over state-of-the-art methods, highlighting the scientific value and practical utility of many-body modeling for mass spectrometry-based molecular generation. Further analysis and ablation studies show that our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.

De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion

Contact-based grasp generation plays a crucial role in various applications. Recent methods typically focus on the geometric structure of objects, producing grasps with diverse hand poses and plausible contact points. However, these approaches often overlook the physical attributes of the grasp, specifically the contact force, leading to reduced stability of the grasp. In this paper, we focus on stable grasp generation using explicit contact force predictions. First, we define a force-aware contact representation by transforming the normal force value into discrete levels and encoding it using a one-hot vector. Next, we introduce force-aware stability constraints. We define the stability problem as an acceleration minimization task and explicitly relate stability with contact geometry by formulating the underlying physical constraints. Finally, we present a pose optimizer that systematically integrates our contact representation and stability constraints to enable stable grasp generation. We show that these constraints can help identify key contact points for stability which provide effective initialization and guidance for optimization towards a stable grasp. Experiments are carried out on two public benchmarks, showing that our method brings about 20% improvement in stability metrics and adapts well to novel objects. The source code will be released upon acceptance.

Force-Aware 3D Contact Modeling for Stable Grasp Generation

In this paper, we address a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences. This task pertains to numerous applications, such as aligned sign language transcription and unsupervised action segmentation and temporal grounding. Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in synchronization with human motion sequences. We demonstrate the superior performance of our approach through evaluation on the two available benchmark datasets, KIT-ML and HumanML3D. As visual evaluation is essential for this task, we provide a comprehensive set of animated visual illustrations of the output of synchronous text generation in the supplementary. The code will be made publicly available.

Transformer with Controlled Attention for Synchronous Motion Captioning

Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.

SERL: Self-Examining Reinforcement Learning on Open-Domain

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the **R**ecurrent **G**eometric-prior **M**ultimodal **P**olicy (**RGMP**), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill selections for unseen scenes with minimal common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot–object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5× greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.

RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

As mixing services are increasingly exploited by malicious actors for illicit transactions, mixing address association has emerged as a critical research task. A range of approaches have been explored, with graph-based models standing out for their ability to capture structural patterns in transaction networks. However, these approaches face two main challenges: label noise and label scarcity, leading to suboptimal performance and limited generalization. To address these, we propose HiLoMix, a graph-based learning framework specifically designed for mixing address association. First, we construct the Heterogeneous Attributed Mixing Interaction Graph (HAMIG) to enrich the topological structure. Second, we introduce frequency-aware graph contrastive learning that captures complementary structural signals from high- and low-frequency graph views. Third, we employ weak supervised learning that assigns confidence-based weighting to noisy labels. Then, we jointly train high-pass and low-pass GNNs using both unsupervised contrastive signals and confidence-based supervision to learn robust node representations. Finally, we adopt a stacking framework to fuse predictions from multiple heterogeneous models, further improving generalization and robustness. Experimental results demonstrate that HiLoMix outperforms existing methods in mixing address association.

HiLoMix: Robust High- and Low-Frequency Graph Learning Framework for Mixing Address Association

Single-channel audio separation, which separates individual audio sources from monophonic mixtures, remains challenging in audio processing. Supervised approaches, typically trained on synthetically generated mixtures, are the prevailing solution. However, these methods depend on high-quality paired training data, which is often scarce or difficult to acquire in real-world scenarios. This data scarcity can hinder model performance on unseen mixing conditions, compromising generalization capabilities. To this end, in this work, we approach the problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only training diffusion priors on individual sources. Separation is then achieved by iteratively guiding an initial state towards the solution through reconstruction guidance. Crucially, we introduce an advanced inverse solver designed for separation, which mitigates gradient conflicts arising from the interference between the diffusion prior and the reconstruction guidance during the denoising process. This effectively ensures high-quality and balanced separation performance across individual sources. In addition, we found that using an augmented mixture instead of pure Gaussian noise to initialize the denoising process is effective, and this informative prior significantly improves the final performance. Furthermore, to enhance audio prior modeling, we designed a novel Time-Frequency (TF) attention-based network architecture that demonstrates powerful audio modeling capabilities. These collective improvements significantly enhance separation performance, as demonstrated by our experimental results across speech-sound event, sound event, and speech separation tasks. Audio demonstrations are available in the Supplementary Material.

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Deep learning have achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces 
$\textbf{P}$ost-hoc $\textbf{C}$oncept $\textbf{B}$ottleneck $\textbf{M}$odel via $\textbf{Re}$presentation $\textbf{D}$ecomposition ($\textbf{PCBM-ReD}$), a novel pipeline that synergizes the strengths of both paradigms. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP’s visual-text alignment, it decomposes image representations into linear combination of concept text embeddings to fit into the CBMs abstraction. Our extensive experiments across 11 image classification tasks demonstrate that PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability. The code will be released.

Downloads

Next from AAAI 2026

Neighbor-aware Label Refinement: Enhancing Unreliable Instance-Dependent Partial Labels

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Neighbor-aware Label Refinement: Enhancing Unreliable Instance-Dependent Partial Labels

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads