Singapore

Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model’s performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

AAAI 2026

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

math reasoning

data contamination

large language models

reinforcement learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

To protect clients' right to be forgotten in federated learning, federated unlearning aims to remove the data contribution of leaving clients from the global learned model. While current studies mainly focused on enhancing unlearning efficiency and effectiveness, the crucial aspects of efficiency fairness and performance fairness among decentralized clients during unlearning have remained largely unexplored. In this study, we introduce FedShard, the first federated unlearning algorithm designed to concurrently guarantee both efficiency fairness and performance fairness. FedShard adaptively addresses the challenges introduced by dilemmas among convergence, unlearning efficiency, and unlearning fairness. Furthermore, we propose two novel metrics to quantitatively assess the fairness of unlearning algorithms, which we prove to satisfy well-known properties in other existing fairness measurements. Our theoretical analysis and numerical evaluation validate FedShard's fairness in terms of both unlearning performance and efficiency. We demonstrate that FedShard mitigates unfairness risks such as cascaded leaving and poisoning attacks and realizes more balanced unlearning costs among clients. Experimental results indicate that FedShard accelerates the data unlearning process 1.3-6.2 times faster than retraining from scratch and 4.9 times faster than the state-of-the-art exact unlearning methods.

FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. 

To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas:
1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions;
2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible.

Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility. Code and related materials will be made available.

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness across various applications, yet their computational complexity poses significant scalability challenges. In contrast, structure-agnostic Multi-Layer Perceptrons (MLPs) offer computational efficiency and scalability but traditionally struggle with explicit graph data. To leverage the strengths of both, GNN-to-MLP Knowledge Distillation (KD) methods transfer relational inductive biases from GNNs to MLPs, equipping MLPs with graph-aware capabilities rivaling or even surpassing their teacher GNNs. In this paper, we theoretically answer how knowledge distillation unlocks MLPs’ potential for graph tasks from the perspective of training dynamics, demonstrating that label alignment in KD fundamentally reshapes the Neural Tangent Kernel (NTK) matrix of student MLPs to enable them to learn the teacher model's implicit graph bias. We further investigate finer-grained distillation paradigms, and reveal that conventional layer-wise output alignment fails to effectively align deep-layer graph propagation outcomes. To address this, we propose Dual-Stream Aligned MLP (DA-MLP), which incorporates complementary graph filters in a dual-stream architecture to simultaneously enhance feature space dimensionality for improved represenation alignment while preserving graph signals across different frequency bands. Comprehensive experiments on seven benchmark datasets validate that DA-MLP can be seamlessly integrated into existing knowledge distillation frameworks and consistently demonstrates performance enhancements in both transductive and inductive settings.

Demystifying GNN-to-MLP Knowledge Transfer: Theoretical Grounding and Dual-Stream Distillation Method

We present Flow-Induced Diagonal Gaussian Processes (FiD-GP), a compression framework that incorporates a compact inducing weight matrix to project a neural network’s weight uncertainty into a lower-dimensional subspace. Critically, FiD-GP relies on normalising flow variational posterior and spectral regularisations to augment its expressiveness and align the inducing subspace with feature-gradient geometry through a numerically stable projection mechanism objective. Furthermore, we demonstrate how the prediction framework in FiD-GP can help to design a single pass projection for Out-of-Distribution (OoD) detection. Our analysis shows that FiD-GP improves uncertainty estimation ability on various tasks compared with SVGP-based baselines, satisfies tight spectral residual bounds with theoretically guaranteed OoD detection, and significantly compresses the neural network’s storage requirements at the cost of increased inference computation dependent on the number of inducing weights employed. Specifically, in a comprehensive empirical study spanning regression, image classification, semantic segmentation, and Out-of-Distribution detection benchmarks, it significantly cuts Bayesian training cost, compresses parameters by roughly 51%, reduces model size by about 75%, and matches state-of-the-art accuracy and uncertainty estimation.

Flow-Induced Diagonal Gaussian Processes

Medical language models face critical barriers to real-world clinical reasoning applications. 
However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage.
To this end, we propose MedS$^3$, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. 
Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. 
Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. 
Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct.
Experiments on eleven benchmarks show that MedS$^3$ outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points.
Additional empirical analysis further demonstrates that MedS$^3$ achieves robust and faithful reasoning behavior.

MedS³: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

In semi‑supervised multi‑view classification (SMVC), scarce labels and noisy unlabeled data impair feature aggregation and compromise prediction reliability, while existing methods lack principled guidance and interpretability. To overcome these limitations, we propose a novel unified SMVC framework, Neural Collapse Priors Driven Trust Semi-Supervised Multi-View Classification (NCPD-TSMVC), builting upon neural collapse–derived prototype priors and evidential opinion fusion. Concretely, we rigorously prove under neural collapse theory that normalized classifier weights from the labeled‑data pre‑training stage coincide with class centroids in feature space, conferring maximal inter‑class separation and optimal within‑class compactness. These prototype priors permeate the entire learning pipeline, calibrating the representation learning of unlabeled samples to obtain highly discriminative embeddings. Simultaneously, our evidential learning module quantifies epistemic uncertainty and fuses view‑level opinions at the evidence level, yielding robust and transparent decision making. Extensive evaluations across diverse benchmarks demonstrate that NCPD‑TSMVC surpasses state‑of‑the‑art SMVC approaches in performance, robustness and interpretability.

Neural Collapse Priors Driven Trust Semi-Supervised Multi-View Classification

Temporal Action Detection (TAD) aims to identify specific actions in long, untrimmed videos by determining their start, end times and categories, yet existing models suffer from performance degradation under out-of-distribution scenarios due to unrealistic i.i.d. assumptions. While domain generalization (DG) offers a promising solution, image-based DG methods fail to address the unique spatiotemporal challenges in video-based TAD, including the spatiotemporal complexities and significant variations in action instance scales and densities across domains. To bridge this gap, we propose the first DG framework tailored for TAD. We propose Scene-Aware Video Segmentation, which segments videos based on semantic similarity, addressing cross-domain action instance density and scale discrepancies. Additionally, we present Temporal-Aware Normalization Perturbation to generate diverse video features while preserving temporal integrity. We establish the first DG-TAD benchmark, evaluating 11 state-of-the-art DG methods across four datasets. The experiments demonstrate that our framework consistently outperforms existing approaches, achieving superior generalization on unseen domains. The proposed modules are architecture-agnostic, offering plug-and-play compatibility for broader video understanding tasks.

Scene-Aware Spatiotemporal Generalization: Towards Robust Temporal Action Detection Across Domains

Graphs effectively model interactions in real-world applications such as social and trade networks, where Graph Neural Networks (GNNs) excel at tasks such as link prediction to enhance user experiences. Despite these benefits, users raise privacy concerns as user data can be exploited to improve GNN performance without consent. Accordingly, various graph unlearning methods have been developed. Prior work shows that comparing models before and after unlearning enables attackers to launch former membership inference attacks (FMIA) on unlearned data. However, the imprint of unlearned data left in the unlearned model itself remains underexplored, and existing membership inference methods mainly exploit overfitting, making them ineffective for identifying unlearned data. To address this, we conducted theoretical analysis and proposed an attack framework targeting unlearned GNNs by learning the distribution patterns of unlearned data to distinguish them from normal test data. Extensive experiments on four real-world datasets and GNN architectures confirm our framework's effectiveness and reveal significant vulnerabilities in current graph unlearning methods.

Imprint of the Forgotten: Stealthy Membership Inference in Unlearned Graph Neural Networks

Federated Recommendation (FR) is a distributed framework for training recommendation models, which enhances privacy by sharing model parameters instead of raw data. However, the large number of parameters, primarily due to the massive item embeddings, significantly hampers communication efficiency. While existing studies mainly focus on improving the efficiency of FR models, they largely overlook the issue of embedding parameter overhead. To address this gap, we propose a FR training framework with Parameter Efficient Fine Tuning (PEFT) based embedding designed to reduce the volume of embedding parameters that need to be transmitted. Our approach offers a lightweight, plugin-style solution that can be seamlessly integrated into existing FR methods. In addition to incorporating common PEFT techniques such as LoRA and Hash-based encoding, we explore the use of Residual Quantized Variational Autoencoders (RQ-VAE) as a novel PEFT strategy within our framework. Extensive experiments across various FR model backbones and datasets demonstrate that our framework significantly reduces communication overhead while improving accuracy.

Plug-and-Play Parameter-Efficient Tuning of Embeddings for Federated Recommendation

Customized text-to-video generation (CTVG) has recently witnessed significant progress in generating tailored videos from user-specific text. However, existing CTVG methods unrealistically assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with catastrophic forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling catastrophic forgetting and concept neglect. Specifically, to address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Furthermore, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention and attention-guided noise estimation. Experimental comparisons demonstrate that our CCVD model outperforms existing CTVG models.

Downloads

Next from AAAI 2026

FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads