Singapore

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason — imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs&#39; critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon — &quot;LLMs incline to judge solutions with lower perplexity as correct&quot;, which is dubbed as imbalanced evaluation preference. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

AAAI 2026

Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

perplexity-aware reinforcement learning

evaluation preference

critiquing capability

mathematical reasoning

large language models

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason — imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon — "LLMs incline to judge solutions with lower perplexity as correct", which is dubbed as imbalanced evaluation preference. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Multi-view clustering (MVC) aims to enhance clustering performance by integrating complementary information from diverse sources. Existing deep MVC methods often face trade-offs and compromises between learning shared consensus representations and preserving view-specific characteristics: they either employ separate encoders that limit collaboration or rely on a single shared encoder at the expense of diversity. Recently, Mixture-of-Experts (MoE) models have been introduced to MVC to facilitate cooperation, but their flattened expert pool design leads to entangled shared and specific information, while their routing mechanism overlooks valuable cross-view context. To address these challenges, we propose a novel framework—Decoupled Mixture-of-Experts with Context-Aware Routing (DMCAR). First, we design a Decoupled MoE (D-MoE) architecture comprising a public expert pool for learning shared representations and private expert pools for capturing unique information from each view, structurally enforcing representation decoupling. Second, we introduce a Context-Aware Hierarchical Routing (CAHR) mechanism that leverages a global context vector to guide routing decisions when selecting experts from the shared pool, enabling more intelligent cross-view collaboration. Finally, we adopt a multi-level contrastive learning paradigm, enforcing semantic consistency in shared representations through cross-view alignment loss while promoting decoupling between shared and specific representations via orthogonality constraints. Extensive experiments on multiple benchmark datasets demonstrate that DMCAR significantly outperforms state-of-the-art methods across various clustering metrics and validates the effectiveness of each component in our framework.

DMCAR: Disentangled Mixture-of-Experts with Context-Aware Routing for Multi-View Clustering

Partially observable Markov decision processes (POMDPs) are a central model for uncertainty in sequential decision making. 
The most basic objective is the reachability objective, where a target set must be eventually visited, and the more general parity objectives can model all $\omega$-regular specifications.
For such objectives, the computational analysis problems are the following: 
(a) qualitative analysis that asks whether the objective can be satisfied with probability $1$ (almost-sure winning) or probability arbitrarily close to $1$ (limit-sure winning);
and (b) quantitative analysis that asks for the approximation of the optimal probability of satisfying the objective.
For general POMDPs, almost-sure analysis for reachability objectives is EXPTIME-complete, but limit-sure and quantitative analysises for reachability objectives are undecidable; almost-sure, limit-sure, and quantitative analysises for parity objectives are all undecidable.
A special class of POMDPs, called revealing POMDPs, has been studied recently in several works, and for this subclass the almost-sure analysis for parity objectives was shown to be EXPTIME-complete.
In this work, we show that for revealing POMDPs the limit-sure analysis for parity objectives is EXPTIME-complete, and even the quantitative analysis for parity objectives can be achieved in EXPTIME.

Revealing POMDPs: Qualitative and Quantitative Analysis for Parity Objectives

Deep hashing offers efficient storage and fast retrieval capabilities. As a result, it has been extensively applied to large‑scale retrieval tasks. To alleviate the dependence on high-quality annotated data, recent research has focused on unsupervised domain adaptive hashing methods, which aim to transfer knowledge from a label-rich source domain to a label-scarce target domain. However, in open-world scenarios, source domain labels are often inevitably noisy, which tends to undermine the quality of learned hash codes and induce considerable performance deterioration. To this end, we introduce a novel Robust Domain Adaptive Hashing (RDAH) method to jointly mitigate the adverse effects of label noise and domain discrepancy. Specifically, we first model the loss distribution of training samples using a two-component Gaussian mixture model to estimate each sample’s confidence, based on which the data is partitioned. Subsequently, we introduce a neighbor consistency-guided correction strategy, which leverages the semantic structure of high-confidence neighbors to perform weighted correction on noisy samples. Moreover, we design a dual-level cross-domain alignment mechanism that jointly mitigates domain shift from two complementary perspectives. Extensive experimental results validate the effectiveness and robustness of RDAH across multiple benchmark datasets.

Robust Domain Adaptive Hashing via Structural Noise Modeling and Correction

In some professional sports leagues, inter-league games are scheduled among multiple divisions or conferences. This inspired us to study the $p$-partite Traveling Tournament Problem ($p$-partite TTP), where teams are partitioned into $p$ leagues, and each team plays games against teams from different leagues. Previously, only the case of $p=2$, known as the Bipartite TTP or BTTP, has been introduced and studied (AAAI 2011 and IJCAI 2024).

In this paper, we show that the $p$-partite TTP is NP-hard for any fixed $p \geq 3$, and we propose an efficient algorithm based on a solution to the TSP. Furthermore, we prove that the algorithm achieves a novel approximation ratio of $\frac{8}{3} + O(\frac{1}{n})$ when $p=3$. We also conduct experiments demonstrating that the algorithm produces practical schedules with significantly reduced total travel distances, highlighting its effectiveness in generating high-quality multipartite tournament schedules.

A TSP-Based Algorithm for Multi-League Traveling Tournament

Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning.
Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks.
Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive.
However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server.
To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. 
To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations.
In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity.
Our method is training-free, not relying on additional training resources on either the client or server side.
Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.

TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

Dynamic graphs are common in real‑world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the full complexity of temporal evolution. 
They tend to overlook fine‑grained variations in interaction order, struggle with dependencies that span long time horizons, and provide limited modeling of pair‑specific relational dynamics. To address those challenges, we propose Graph2Video, a video‑inspired framework that views the temporal neighborhood of a target link as a sequence of “graph frames”. By stacking temporally ordered subgraph frames into a “graph video”, Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight, plug-and-play, link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state‑of‑the‑art baselines in the link prediction task on most cases. The results highlight that borrowing spatio‑temporal modeling techniques from computer vision provides a principled and effective avenue for advancing dynamic graph learning.

Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems.
In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS’s resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent’s prompt based on its connections and interactions with other agents. 
Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.

ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.

WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-dimensional Annotation

For artificial intelligence to be safely deployed in high-risk domains, it must reliably know its limits. Selective predic- tion, or learning with a reject option, addresses this by en- abling a model to abstain from prediction on inputs it deems unreliable, deferring them to a human expert. While deep en- sembles have emerged as a leading approach for uncertainty estimation, their potential is often squandered by rejection methods that rely on static thresholds applied to the mean prediction. In this paper, we propose to learn a dynamic rejec- tion policy directly from the rich behavioral signals of the en- semble itself. Our framework, DEGRE (Dynamic Ensembles Gating for REjection), is a novel meta-learning approach that trains a lightweight gating network on the ensemble’s con- sensus confidence and its internal disagreement (variance)— to explicitly discriminate between correct and incorrect pre- dictions. Through rigorous evaluation across twelve diverse medical imaging benchmarks (MRI, X-ray, CT), DEGRE sig- nificantly advances selective prediction, achieving an aver- age risk-coverage (AURC) reduction of 68.2% compared to the standard ensemble baseline. By providing a more reli- able method for a model to recognize its own limitations, this learned, adaptive rejection mechanism provides the ro- bust self-awareness necessary for true AI-in-the-loop (AI2L) systems, paving the way for the safe and responsible integra- tion of AI into critical clinical workflows.

Content not yet available

Next from AAAI 2026

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES