Singapore

Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning utilizing rule-based rewards. However, the explicit reasoning process has not yet yielded substantial benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of achieving human-level auditory-language reasoning.
To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs through improved adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that assist the model in distinguishing between valid and flawed reasoning paths during training. Experimental results demonstrate that Audio-Thinker models outperform existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

AAAI 2026

Audio-Thinker: Guiding Large Audio Language Model When and How to Think via Reinforcement Learning

nlp: speech

ml: reinforcement learning

ml: active learning

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We precisely characterize the expressivity of computable Recurrent Graph Neural Networks (recurrent GNNs). We prove that recurrent GNNs with finite-precision parameters, sum aggregation, and ReLU activation, can compute any graph algorithm that respects the natural message-passing invariance induced by the Color Refinement (or Weisfeiler-Leman) algorithm. While it is well known that the expressive power of GNNs is limited by this invariance [Morris et al., AAAI 2019; Xu et al., ICLR 2019], we establish that recurrent GNNs can actually match this limit. This is in contrast to non-recurrent GNNs, which have the power of Weisfeiler-Leman only in a very weak, "non-uniform", sense where each graph size requires a different GNN to compute with. Our construction introduces only a polynomial overhead in both time and space.
Furthermore, we show that by incorporating random initialization, for connected graphs recurrent GNNs can express all graph algorithms. In particular, any polynomial-time graph algorithm can be emulated on connected graphs in polynomial time by a recurrent GNN with random initialization.

Repetition Makes Perfect: Recurrent Graph Neural Networks Match Message Passing Limit

Multimodal Malicious Content Detection (MMCD) is crucial for online ecosystems. Existing methods assume identical distributions between training (source) and inference (target) data. However, malicious content often evolves into irregular and ambiguous forms to evade censorship, resulting in substantial semantic drift and rendering previously trained models ineffective. Test-Time Adaptation (TTA) offers a solution by adapting models during inference to narrow the cross-domain gap, while conventional TTA methods target mild distribution shifts and struggle with the severe semantic drift in MMCD. To tackle these challenges, we propose SCANNER, the first TTA framework tailored for MMCD. Motivated by the insight that, despite the evolving nature of malicious manifestations, their underlying cores remain largely invariant (i.e., targeting is still based on characteristics like gender, race, etc), we leverage these stable cores as a bridge to connect the source and target domains. Specifically, SCANNER initially reveals the stable cores from the ambiguous layout in evolving malicious content via a principled centroid-guided alignment mechanism. To alleviate the impact of outlier-like samples that are weakly correlated with centroids during the alignment process, SCANNER enhances the prior by incorporating a sample-level adaptive centroid alignment strategy, promoting more stable adaptation. Furthermore, to mitigate semantic collapse from overly uniform outputs within clusters, SCANNER introduces an intra-cluster diversity regularization that encourages the cluster-wise semantic richness. Experiments show that SCANNER outperforms all baselines, with an average gain of 6.26% in Macro-F1 over the best.

Shedding the Facades, Connecting the Domains: Detecting Shifting Multimodal Hate Video with Test-Time Adaptation

Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While GDBA provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in our DGLS. Our extensive empirical results on various standard benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Particularly, compared to Damped Max-sum with high damping factors (e.g., 0.7 or 0.9), our DGLS achieves competitive performance on general-valued problems, and outperforms it by significant margins (\textbf{3.77\%--66.3\%}) on structured problems in terms of anytime results.

GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization

In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.

Binary Split Categorical Feature with Mean Absolute Error Criteria in CART

Large Language Models show promise in emotion understanding, social reasoning, and empathy, yet they struggle with psychologically grounded tasks that require inferring implicit mental states in context-rich, ambiguous settings. These limitations arise from the absence of theory-aligned supervision and the difficulty of capturing nuanced mental processes in real-world narratives. To address this gap, we leverage expert-labeled, psychologically rich scenarios and propose a trajectory-aware reinforcement learning framework that explicitly imitates expert psychological thought patterns. By integrating real-world stimuli with structured reasoning guidance, our approach enables compact models to internalize social-cognitive principles, perform nuanced psychological inference, and support continual self-improvement. Comprehensive experiments across multiple benchmarks further demonstrate that our models achieve expert-level interpretive capabilities, exhibiting strong out-of-distribution generalization and robust continual learning across diverse, challenging, and psychologically grounded tasks.

From Stimuli to Minds: Enhancing Psychological Reasoning in LLMs via Bilateral Reinforcement Learning

Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection. Code availability and reproducibility details are in the Supplementary Materials.

SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

Accurately determining the shape of deformable objects and
the location of their internal structures is crucial for
medical tasks that require precise targeting, such as
robotic biopsies.
We introduce LUDO, a method for accurate low-latency
understanding of deformable objects.
LUDO reconstructs objects in their deformed state,
including their internal structures, from a single-view
point cloud observation in under 30 ms using occupancy
networks.
LUDO provides uncertainty estimates for its predictions.
Additionally, it provides explainability by highlighting
key features in its input observations.
Both uncertainty and explainability are important for
safety-critical applications such as surgery.
We evaluate LUDO in real-world robotic experiments,
achieving a success rate of 98.9% for puncturing various
regions of interest (ROIs) inside deformable objects.
We compare LUDO to a popular baseline and show its superior
ROI localization accuracy, training time, and memory
requirements.
LUDO demonstrates the potential to interact with deformable
objects without the need for deformable registration
methods.

LUDO: Low-latency understanding of deformable objects using
point cloud occupancy functions

The rapid advancements in artificial intelligence (AI) have primarily focused on the process of learning from data to acquire knowledgeable learning systems. As these systems are increasingly deployed in critical areas, ensuring their privacy and alignment with human values is paramount. Recently, selective forgetting (also known as machine unlearning) has shown promise for privacy and data removal tasks, and has emerged as a transformative paradigm shift in the field of AI. It refers to the ability of a model to selectively erase the influence of previously seen data, which is especially important for compliance with modern data protection regulations and for aligning models with human values. Despite its promise, selective forgetting raises significant privacy concerns, especially when the data involved come from sensitive domains. While new unlearning-induced privacy attacks are continuously proposed, each is shown to outperform its predecessors using different experimental settings, which can lead to overly optimistic and potentially unfair assessments that may disproportionately favor one particular attack over the others. In this work, we present the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. We extensively investigate privacy vulnerabilities of machine unlearning techniques and benchmark privacy leakage across a wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. We systematically evaluate and identify critical factors related to unlearning-induced privacy leakage. With our organized codebase and novel insights, we aim to provide a standardized tool for practitioners seeking to deploy customized unlearning applications with faithful privacy assessments.

Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models

Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs—manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose \textbf{Reinforcement Mixing Learning (RMLer)}, a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer’s superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.

RMLer: Synthesizing Novel Objects Across Diverse Categories via Reinforcement Mixing Learning

Sketch person re-identification aims to match hand-drawn sketches with RGB surveillance images, yet remains challenging due to severe modality discrepancies and limited labeled data. To address this, we propose KTCAA, a theoretically grounded and interpretable framework for few-shot cross-modal transfer learning. From the perspective of generalization bounds, we identify two key controllable factors essential to minimizing target domain error: (1) domain discrepancy, which reflects the difficulty of aligning source and target distributions in the feature space; and (2) perturbation invariance, which measures the model’s robustness to cross-modal variations. To address these challenges, we design two corresponding modules: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target modality distributions, introducing slight but meaningful feature shifts that guide gradual distribution alignment during training; and (2) Knowledge Transfer Catalyst (KTC), which enhances perturbation invariance by generating worst-case adversarial modality perturbations and enforcing output consistency under such perturbations. These modules are jointly optimized within a meta-learning framework that transfers alignment knowledge from RGB-rich domains to sketch scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly under limited data and cross-domain transfer settings.

Downloads

Next from AAAI 2026

Repetition Makes Perfect: Recurrent Graph Neural Networks Match Message Passing Limit

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES