Singapore

Large Language Models (LLMs) have achieved remarkable success in instruction-following and dialogue tasks, yet aligning them with human preferences remains a critical challenge. Recent advances such as Direct Preference Optimization (DPO) simplify the alignment pipeline by bypassing explicit reward modeling, but they often suffer from suboptimal reward margin distributions, leading to weak supervision signals and reduced discriminative capacity. In this work, we propose Reward Margin Optimization (RMO), a framework that reshapes reward margin distributions during training to improve alignment performance. RMO comprises three components: (1) a Dual Denoising Filtering strategy that filters ambiguous and noisy preference pairs based on reward margin dynamics; (2) Batch Margin Diversification, which maximizes intra-batch margin variance to enhance learning signal diversity; and (3) Pairwise Margin Amplification, an auxiliary regularization term that encourages larger margins between preferred and dispreferred responses. Extensive experiments on multiple LLMs and datasets demonstrate that RMO consistently improves win rates over strong baselines such as DPO and SimPO, while remaining compatible with various preference-based optimization methods. Our results highlight the critical role of reward margin distribution in preference alignment and establish RMO as an effective and scalable enhancement to existing alignment techniques.

AAAI 2026

RMO: Towards Better LLM Alignment via Reshaping Reward Margin Distributions

nlp: (large) language models

nlp: generation

ml: optimization

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements.
To address these challenges, we propose \textbf{MolSight}, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks—specifically chemical bond classification and atom localization—contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight's relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model's performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.

Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset

Multi-modal Sentiment Analysis (MSA) enables machines to perceive human sentiments by integrating multiple modalities such as text, video, and audio. Despite recent progress, most existing methods assume distribution consistency between training and test data—a condition rarely met in real-world scenarios. To address domain shifts without relying on source data or target labels, Test-Time Adaptation (TTA) has emerged as a promising paradigm. However, applying TTA methods to MSA faces two challenges: a representation bottleneck inherent to the regression formulation and the inconsistency in modality fusion caused by modality-specific data augmentation techniques. To overcome these issues, we propose Group-aware Multiscale Ensemble Learning (GMEL), which leverages a von Mises-Fisher (vMF) mixture distribution to model latent sentiment groups and integrates a multi-scale re-dropout strategy for modality-agnostic feature augmentation, preserving fusion consistency. Extensive experiments on three benchmark datasets using two backbone architectures show that GMEL significantly outperforms existing baselines, demonstrating strong robustness to test-time distribution shifts in multi-modal sentiment analysis.

Group-aware Multiscale Ensemble Learning for Test-Time Multimodal Sentiment Analysis

Adversarial attacks pose a major challenge to distributed learning systems, prompting the development of numerous robust learning methods. However, most existing approaches suffer from the curse of dimensionality, i.e. the error increases with the number of model parameters. In this paper, we make a progress towards high dimensional problems, under arbitrary number of Byzantine attackers. The cornerstone of our design is a direct high dimensional semi-verified mean estimation method. The idea is to identify a subspace with large variance. The components of the mean value perpendicular to this subspace are estimated using corrupted gradient vectors uploaded from worker machines, while the components within this subspace are estimated using auxiliary dataset. As a result, a combination of large corrupted dataset and small clean dataset yields significantly better performance than using them separately. We then apply this method as the aggregator for distributed learning problems. The theoretical analysis shows that compared with existing solutions, our method gets rid of $\sqrt{d}$ dependence on the dimensionality, and achieves minimax optimal statistical rates. Numerical results validate our theory as well as the effectiveness of the proposed method.

High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers

The modeling of genomic sequences presents unique challenges due to their long length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.

TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality.
Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder.
Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

While leveraging pseudo-labels has become a common paradigm in untargeted gray-box graph poisoning attacks, it suffers from two critical limitations: the use of brittle hard pseudo-labels that overlook uncertainty and can amplify surrogate model errors, and static guidance that progressively becomes stale as the graph is perturbed. To resolve these issues, we propose MetaDist, a novel framework that reframes the attack as an adversarial self-knowledge distillation process. Here, a ''teacher'' model provides continuously refined soft pseudo-labels to a ''student'' model, with the attack objective being to maximize the divergence between them. MetaDist makes two synergistic innovations. It employs the Reverse KL (RKL) divergence as a more strategic attack loss that efficiently converts uncertain nodes into robust, high-confidence errors. Concurrently, it introduces the Online Adaptive Teacher (OAT) mechanism, which adapts the teacher via student feedback to ensure the guidance signal remains relevant. Extensive experiments demonstrate that MetaDist consistently and significantly outperforms strong baselines across multiple datasets, proving its effectiveness and transferability even against advanced graph defenses.

Surrogate as Teacher: Distillation-Guided Graph Poisoning Attack

The capacity for social reasoning, particularly Theory of Mind (ToM), is a foundational prerequisite for aligning Large Language Models (LLMs) with human values. However, current evaluations are predominantly confined to simplistic, short-text scenarios, obscuring their true capabilities and potential failure modes in complex, long-range social dynamics. To address this deficit, we introduce MovieGraph-ToM, a large-scale benchmark for evaluating long-range ToM and social cognition within extended, multimodal narratives. We employ a "scaffold-and-probe" methodology: we construct a ground-truth Social-Causal Graph offline, which maps the narrative's latent mental states and causal chains. During evaluation, the model is denied access to this graph and must reason directly from raw multimodal inputs. This decoupling forces genuine inference over superficial pattern matching. Reasoning is probed via a hierarchical questioning framework designed to differentiate spontaneous understanding from logical robustness. Our empirical results reveal systematic vulnerabilities in even state-of-the-art models. We identify a critical "multiple-choice pitfall," where accuracy plummets against well-crafted distractors, and a stark "generative-discriminative divide," where models fail to construct coherent explanations for answers they correctly identify. These findings highlight a latent risk, as models that feign comprehension could lead to unpredictable and misaligned behaviors. MovieGraph-ToM thus offers a rigorous platform for assessing and advancing the robust social intelligence required for safely aligned AI systems.

MovieGraph-ToM: Evaluating Long-Range Theory of Mind in Large Language Models via Implicit Social-Causal Graphs

We propose Multi-Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi-agent reinforcement learning. MARPO introduces the reflection mechanism into multi-agent settings, effectively leveraging subsequent trajectories to improve sample efficiency. We theoretically derive an asymmetric clipping mechanism that dynamically adjusts the clipping range based on the KL divergence to overcome the limitations of fixed clipping boundaries and improve the stability of the training process. We evaluate MARPO on the StarCraft II Multi-Agent Challenge (SMAC) benchmark, including both standard SMAC tasks and the more challenging SMAC-Hard variants, with results demonstrating its superior performance.The code is provided in the supplementary material.

MARPO: A Reflective Policy Optimization for Multi-Agent Reinforcement Learning

Large language models (LLMs) have demonstrated their instruction-following capabilities and achieved powerful performance on various tasks. Inspired by their success, recent works in the molecular domain have led to the development of large molecular language models (LMLMs) that integrate 1D molecular strings or 2D molecular graphs into the language models. However, existing LMLMs often suffer from hallucination and limited robustness, largely due to inadequate integration of diverse molecular modalities such as 1D sequences, 2D molecular graphs, and 3D conformations. To address these limitations, we propose CoLLaMo, a large language model-based molecular assistant equipped with a multi-level molecular modality-collaborative projector. The relation-aware modality-collaborative attention mechanism in the projector facilitates fine-grained and relation-guided information exchange between atoms by incorporating 2D structural and 3D spatial relations. Furthermore, we present a molecule-centric new automatic measurement, including a hallucination assessment metric and GPT-based caption quality evaluation to address the limitations of token-based generic evaluation metrics (i.e., BLEU) widely used in assessing molecular comprehension of LMLMs. Our extensive experiments demonstrate that our CoLLaMo enhances the molecular modality generalization capabilities of LMLMs, achieving the best performance on multiple tasks, including molecule captioning, computed property QA, descriptive property QA, motif counting, and IUPAC name prediction.

Downloads

Next from AAAI 2026

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads