Singapore

Vision-Language Models (VLMs) excel at extracting salient visual features for given query images, thus exhibiting promising visual recognition performance. However, VLMs would encounter significant degradation in fine-grained scenarios due to their deficiency in distinguishing nuanced differences among candidate categories. As a remedy, we draw inspiration from the ``System 1 \&amp; System 2&quot; cognitive theory of humans, paving the way to achieve fine-grained recognition for VLMs. To be specific, we observe that VLMs naturally align with System 1, quickly identifying candidate categories but leaving easily-confused ones unresolved. Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of the nuanced differences. In brief, SCAN first specifies and abstracts the discriminative attributes for the confused candidate categories and query images by resorting to off-the-shelf large foundation models, respectively. After that, SCAN adaptively integrates the salient visual features from System 1 with the nuanced differences derived from System 2, resolving confusion in candidates with estimated uncertainty. Extensive experiments on eight widely used fine-grained recognition benchmarks against 10 state-of-the-art baselines verify the effectiveness and superiority of SCAN. The code will be released upon acceptance.

AAAI 2026

Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition

unsupervised recognition

large foundation models

vision-language models

Vision-Language Models (VLMs) excel at extracting salient visual features for given query images, thus exhibiting promising visual recognition performance. However, VLMs would encounter significant degradation in fine-grained scenarios due to their deficiency in distinguishing nuanced differences among candidate categories. As a remedy, we draw inspiration from the ``System 1 \& System 2" cognitive theory of humans, paving the way to achieve fine-grained recognition for VLMs. To be specific, we observe that VLMs naturally align with System 1, quickly identifying candidate categories but leaving easily-confused ones unresolved. Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of the nuanced differences. In brief, SCAN first specifies and abstracts the discriminative attributes for the confused candidate categories and query images by resorting to off-the-shelf large foundation models, respectively. After that, SCAN adaptively integrates the salient visual features from System 1 with the nuanced differences derived from System 2, resolving confusion in candidates with estimated uncertainty. Extensive experiments on eight widely used fine-grained recognition benchmarks against 10 state-of-the-art baselines verify the effectiveness and superiority of SCAN. The code will be released upon acceptance.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the representations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance across four datasets among all tested models, including three recently proposed approaches and three MLLMs. I.e., TEXT wins on at least four metrics out of all six metrics. E.g., TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5\% decrement compared with recently proposed approaches.

A Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

Gait recognition is emerging as a promising technology and
an innovative field within computer vision, with a wide range
of applications in remote human identification. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs
often lead to overfitting on static noise (e.g., clothing), while
failing to effectively capture dynamic motion regions, such as
the arms and legs. This bottleneck is particularly challenging
in the presence of intra-class variation, where gait features of
the same individual under different environmental conditions
are significantly distant in the feature space.
To address the above challenges, we present a Languageguided and Motion-aware gait recognition framework, named
LMGait. To the best of our knowledge, LMGait is the first
method to introduce natural language descriptions as explicit
semantic priors into the gait recognition task. In particular, we
utilize designed gait-related language cues to capture key motion features in gait sequences. To improve cross-modal alignment, we propose the Motion Awareness Module (MAM),
which refines the language features by adaptively adjusting
various levels of semantic information to ensure better alignment with the visual representations. Furthermore, we introduce the Motion Temporal Capture Module (MTCM) to enhance the discriminative capability of gait features and improve the model’s motion tracking ability. We conducted extensive experiments across multiple datasets, and the results
demonstrate the significant advantages of our proposed network. Specifically, our model achieved accuracies of 88.5%,
97.1%, and 97.5% on the CCPG, SUSTech1K, and CASIAB*
datasets, respectively, achieving state-of-the-art performance.

Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

Recent studies reveal that adversaries can manipulate the internal knowledge of large language models (LLMs) on selected topics through model editing, causing attacker-specified harmful or biased outputs when queried about the edited content. Once such tampered LLMs are distributed, they can mislead users on the targeted topics, thereby potentially propagating misinformation or reinforcing stereotypes. However, existing knowledge manipulation attacks rely on the ability to redistribute compromised models, which is infeasible in constrained settings like Federated Instruction Tuning (FedIT), where a central server controls LLM's training and distribution.
In this work, we introduce ShadeEdit, the first attack framework that leverages strengthened model editing to enable knowledge manipulation in FedIT scenarios.
ShadeEdit introduces two key components to address two challenges posed by the training process of FedIT: (1) a \textit{paraphrase-based editing dataset selection strategy} to mitigate the dilution from benign updates on malicious ones by constructing a high-quality editing dataset, and (2) an \textit{adaptive manipulation mechanism} to evade aggregation-based defenses via an adaptive clipping strategy. ShadeEdit achieves an average 99.5\% attack success rate over eight robust aggregation algorithms while preserving instruction-following accuracy, demonstrating its strong attack effectiveness and model-utility preservation. Our code is available at the following anonymous link: https://anonymous.4open.science/r/ShadeEdit-41EA/.

ShadeEdit: A Utility-Preserving and Defense-Evasive Knowledge Manipulation Attack in Federated LLMs

Inspired by the success of large language models (LLMs) in natural language processing, cell language models (CLMs) have emerged as a promising paradigm for learning cell representations from high-dimensional single-cell data—particularly transcriptomic profiles from scRNA-seq. These foundation models have shown remarkable potential across a variety of downstream applications. However, there remains a lack of foundation models for scATAC-seq data, which measures chromatin accessibility at single-cell level and is critical for decoding epigenetic regulation. Developing such models is considerably more challenging due to the unique characteristics of scATAC-seq data, including the vast number of chromatin regions, lack of standardized annotations, extreme sparsity, and near-binary distributions. To address these challenges, we systematically explore various strategies and propose CLM-access, a specialized foundation model for scATAC-seq data. CLM-access incorporates three main innovations: (1) an unified data processing pipeline that maps 2.8 million cells onto an unified reference of over 1 million chromatin regions; (2) a specialized patching and embedding strategy to effectively manage high-dimensional inputs; and (3) a tailored masking and loss function design that preserves fine-grained regional information while enhancing training efficiency and representation quality. With comprehensive benchmarks, we show that CLM-access significantly outperforms existing methods in key downstream tasks, including batch effect correction, cell type annotation, RNA expression prediction, and multi-modal integration. This work establishes a scalable and interpretable foundation model for single-cell epigenomic analysis and expands the application of CLMs in single-cell research. Code is available at https://github.com/HIM-AIM/CLM-access

CLM-Access: A Specialized Foundation Model for High-Dimensional Single-Cell ATAC-Seq Analysis

Mainstream 3D human pose estimation methods directly predict 3D coordinates of joints from 2D keypoints, suffering from severe depth ambiguity. Pose textual descriptions contain abundant semantic information, which facilitates the model to learn the spatial relationship among different body parts, partially alleviating this issue. Leveraging this insight, we propose a 3D human pose estimation method assisted by textual descriptions. Specifically, we utilize an automatic captioning pipeline to generate textual descriptions of 3D poses based on spatial relations among joints. These descriptions include details regarding angles, distances, relative positions, pitch\&roll and ground-contacts. Subsequently, text features are extracted from these descriptions using a language model, while a 3D human pose estimation model extracts pose features. Aligning the pose features with the text features allows for a more targeted optimization of the estimation model. Therefore, we systematically introduce three alignment approaches to effectively align features extracted by two models operating in entirely different domains. Our method incorporates prior knowledge derived from the textual descriptions into the estimation model and can be seamlessly applied to various existing framework. Experimental results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our method surpasses state-of-the-art methods.

Learning Knowledge from Textual Descriptions for 3D Human Pose Estimation

Mixture-of-Experts (MoE) architectures have become a cornerstone for scaling large language models (LLMs) efficiently, yet how their sparse structure shapes knowledge acquisition during pre-training remains unknown. Existing interpretability methods predominantly focus on post-hoc analysis of dense models, overlooking the dynamic, architectural differences that define MoE. To bridge this gap, we introduce Gated-LPI, a neuron-level attribution metric that decomposes log-probability increase across neurons. We present the first time-resolved comparison of knowledge acquisition dynamics in MoE versus dense architectures through tracking checkpoints across 1.2M training steps ($\approx 5.2T$ tokens). Our analysis reveals three key phenomena: (1) Early consolidation. MoE model locks into a stable importance profile within $<$100K steps, whereas the dense model remains volatile throughout training. (2) Low-entropy backbone. The top approximately 1\% of MoE neurons consistently receive $>$45\% of positive updates, creating a persistent, high-utility core absent in the dense baseline. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by $<$10\%, compared with $>$50\% for the dense model, showing that sparsity fosters distributed---rather than brittle---knowledge storage. These phenomena collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training. Together, these findings bridge the gap between sparse architectures and training-time interpretability, offering actionable insights for expert-pruning and routing-strategy design in next generation MoE models.

Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

Parkinson's disease (PD) and Alzheimer's disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, where early diagnosis is critical for delaying their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer's Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach. The code will be released on acceptance of this paper.

DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis

In this paper, we study the Facility Location Problem with Scarce Resources (FLPSR) under the assumption that agents' type follow a probability distribution. In the FLPSR, the objective is to identify the optimal locations for one or more capacitated facilities to maximize Social Welfare (SW), defined as the sum of the utilities of all agents. Since the total capacity of the facilities is insufficient to serve all agents, they compete in a First-Come-First-Served game to get accommodated. The main contribution of this paper ties Optimal Transport theory to the problem of selecting a truthful mechanism tailored to the agents' distributions. For the case of a single facility, we show that an optimal mechanism always exists. We examine three classes of probability distributions and characterize the optimal mechanism either analytically or provide a routine to numerically compute it. We extend our results to the case in which we have two capacitated facilities to place. Initially we assume that agents are independent and identically distributed, but our techniques generalize to scenarios where agents are not identically distributed. Finally, we validate our findings through several numerical experiments, including: (i) deriving optimal mechanisms for the class of beta distributions, (ii) assessing the Bayesian approximation ratio of these mechanisms for small numbers of agents, and (iii) assessing how quickly the expected mechanism SW converges to its limit.

Designing Optimal Mechanisms to Locate Facilities with Insufficient Capacity for Bayesian Agents

Just as a coin has two sides, the impressive performance of large language models (LLMs) also brings inherent toxicity risks, prompting the need for effective detoxification to support responsible deployment. Prevailing methods generally follow an inflexible model-specific fashion, addressing only individual models or model families. Moreover, overlooking the underlying toxic risks involved in the input prefix can lead to toxic accumulation during autoregressive generation. Existing methods rely on external strong attribute interventions to address this issue, which further exacerbates contextual semantic inconsistencies and makes it difficult to balance toxicity efficacy and generation quality. To address these concerns, we propose a novel Model-Agnostic Adaptive Detoxification (MAAD) framework. To address accumulating toxicity, we present prefix heuristics that serve as contextual signals, guiding the base LLM toward safer generation. Along this line, we construct an antidote dataset to support a lightweight model, Detoxifier, which steers the base LLM to make in-scope and reliable detoxifying distribution adjustments while preserving fluency and contextual understanding. Designed as an easy-to-deploy module, Detoxifier requires a small amount of data and can be seamlessly applied to various base LLMs with one-off training. Since over-purifying often reduces diversity, we also propose a dynamic truncation method called CW-cutoff sampling to trade off language model quality and diversity. Extensive experiments demonstrate that MAAD strikes a better balance between detoxification effectiveness and generation quality, while also maintaining model utility.

Downloads

Next from AAAI 2026

Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads