Singapore

Gait recognition has emerged as a promising biometric technique for long-distance and non-intrusive human identification. While Transformers have revolutionized vision tasks, their adaptation to gait recognition remains underexplored due to domain-specific challenges such as sparse silhouette modality, spatial-temporal dynamics, fine-grained motion cues, and limited training data. In this paper, we propose Gait Transformer (GaT), an end-to-end Transformer backbone specifically tailored for silhouette-based gait recognition. GaT introduces three key components: (1) a hybrid patch embedding module that combines convolutional stems with group-batch normalization to enhance structural preservation; (2) a decomposed token mixer that explicitly models both short-range and long-range dependencies across spatial-temporal dimensions; and (3) a hybrid positional encoding strategy that integrates absolute, relative, and rotary embeddings to support efficient training under data scarcity. Without relying on any pretraining, GaT achieves state-of-the-art performance on Gait3D, GREW, and CCGR-MINI.

AAAI 2026

Gait Transformer: End-to-End Transformer Backbone for Gait Recognition

representation learning for vision

applications

biometrics

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Diffusion models have revealed powerful potential in all-in-one image restoration (AiOIR), which is talented in generating abundant texture details. The existing AiOIR methods either retrain a diffusion model or fine-tune the pretrained diffusion model with extra conditional guidance. However, they often suffer from high inference costs and limited adaptability to diverse degradation types. In this paper, we propose an efficient AiOIR method, Diffusion Once and Done (DOD), which aims to achieve superior restoration performance with only one-step sampling of Stable Diffusion (SD) models. Specifically, multi-degradation feature modulation is first introduced to capture different degradation prompts with a pretrained diffusion model. Then, parameter-efficient conditional low-rank adaptation integrates the prompts to enable the fine-tuning of the SD model for adapting to different degradation types. Besides, a high-fidelity detail enhancement module is integrated into the decoder of SD to improve structural and textural details. Experiments demonstrate that our method outperforms existing diffusion-based restoration approaches in both visual quality and inference efficiency.

Diffusion Once and Done: Degradation-Aware LoRA for All-in-One Image Restoration

Magnetic Particle Imaging (MPI) is an innovative medical modality, providing nanomolar-scale in vivo sensitivity and radiation-free dynamic real-time detection for precision medicine. However, MPI faces a challenging problem in accurately visualizing nanoparticle distributions, where the reconstructed images with unidirectional scanning exhibit anisotropy. The anisotropy in spatial resolution leads to distortion and blurred image boundaries. Existing deep learning methods for anisotropy calibration are only limited to simulation data due to lacking of real-world MPI datasets. To address the aforementioned problems, we spent over three years designing and constructing a real-world MPI anisotropic image datasets (20,156 images) with diverse phantoms (sensitivity, resolution, vessel, shape) and animal scanning. Then, we introduce a novel Mamba-based method, MPI-Mamba, for anisotropic image calibration. Specifically, we propose a latent feature fusion state space model (LFF-SSM) block for feature fusion and leverage conditional latent diffusion model (CL-DM) branch for feature extraction. The CL-DM is performed to extract latent features in a highly compressed latent space for guiding the calibration and deblurring process. Next, we exploit the LFF-SSM to fully fuse the extracted multi-scale features to capture contextual information from the image structure, enabling the model to learn the overall distribution of signal concentration. We evaluate our method and competing methods on simulation dataset and our constructed diverse real-world MPI datasets. The results show that our proposed approach outperforms competing methods for anisotropic image calibration and deblurring. Source code and real-world MPI dataset will be available upon acceptance.

MPI-Mamba: Latent Feature Fusion Mamba for Anisotropic Image Calibration and Deblurring in Magnetic Particle Imaging

Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even in an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the based eviction method, demonstrating robustness and effectiveness. Our code will be available at \url{https://github.com/AMD-AIG-AIMA/AMD-Spark}.

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Semi-Supervised Learning (SSL) aims to improve the learning performance of supervised learning with a large number of unlabeled samples. The existing SSL methods such as FixMatch and FlexMatch select unlabeled samples with high-confident pseudo-labels and make consistency constraints between their weak and strong augmentations. Unfortunately, they cannot be applied Semi-Supervised Regression (SSR) because regression predictions can not reflect the confidence of pseudo-labels. To solve this, a recent SSR method RankUp incorporates an auxiliary ranking task by leveraging sample pairs with high-confident pseudo-ranks. In this paper, we upgrade Rankup to a novel SSR method, namely Semi-Supervised Regression by Ranking Close Unlabeled Samples (SSR-RCUS). Its basic idea is reconstructing closed mixup augmented samples with high-confident pseudo-ranks under a monotonicity assumption, and then applying them to the auxiliary ranking task to improve regression performance. We conduct extensive experiments to evaluate the performance of SSR-RCUS on benchmark datasets, and empirical results demonstrate that SSR-RCUS can outperform the existing baselines in various settings, especially when labeled data are scarce.

Semi-Supervised Regression by Preserving Ranking Relationships Between Close Unlabeled Samples

In recent years, transformer-based models have achieved remarkable success in sensitive domains, including healthcare, finance and personalized services, but their deployment raises significant privacy concerns. 
Existing secure inference studies have introduced cryptographic techniques such as Homomorphic Encryption(HE) and Secure Multi-Party Computation(MPC). 
However, these approaches either target isolated model components or incur prohibitive computational and communication overheads, failing to support latency-sensitive or resource-limited environments. 
In our investigation, we identify substantial redundancy in the non-linear operations and their alternation with linear layers in deep learning. 
Motivated by this observation, we propose PCFormer, a universal optimization methodology tailored for sequences of linear and non-linear computations in the Transformer. 
PCFormer introduces structure-aware partition and combination techniques specially designed for Multi-Head Attention (MHA) and Feed-Forward Network (FFN). Specifically, we reveal the discrete sources of redundancy in the Softmax and GeLU functions during inference, implementing partitions at the token and channel levels, respectively. Subsequently, these reductions are then combined with the preceding and succeeding linear operations, thereby enhancing both computational and communication efficiency. 
Experimental results on GLUE benchmarks demonstrate that PCFormer achieves a 1.9× speedup in both computation and communication without compromising accuracy, compared to existing privacy-preserving Transformer frameworks. 
Furthermore, we demonstrate that PCFormer generalizes effectively to other deep learning architectures involving structured linear-nonlinear compositions under cryptographic constraints.

PCFormer: Accelerating Privacy-preserving Transformer Inference by Partition and Combination

Hallucination has emerged as a pivotal challenge of 
Large Language Models (LLMs) that generate plausible yet non‑factual content, significantly impeding the trustworthy AI applications in real-world scenarios like medical diagnosis and autonomous driving.
Editing the internal activations of LLMs during inference has shown promising effectiveness in mitigating hallucinations with minimal cost. 
However, previous editing approaches neglect the query‑specific inference pathways that require tailored truthful steering vectors, resulting in suboptimal hallucination mitigation. 
To address these issues, we propose the \emph{\textbf{Q}uery-\textbf{R}outed \textbf{A}ctivation \textbf{E}diting (QRAE)} framework, which comprises \emph{Divergence-sensitive Head Routing (DHR)} and \emph{Truth-hierarchical Preference Steering (TPS)}, to fully leverage query-specific semantics for adaptive activation editing. 
Specifically, DHR is proposed to establish a query-aware head selection criterion, thereby dynamically routing to truth-critical attention heads.
Subsequently, TPS introduces a query-specific steering vector calibration policy with the guidance of progressive truth-preferred optimization, enabling precise and adaptive editing for each distinct query. 
Extensive experiments on the widely recognized TruthfulQA benchmark demonstrate that QRAE outperforms SOTA methods by up to 13.2\% in MC1. 
Meanwhile, QRAE demonstrates strong generalization to out-of-distribution TriviaQA and Natural Questions benchmarks.

Query-Routed Activation Editing with Truth-hierarchical Preference Optimization

Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Point cloud tasks have recently benefited from Mamba-based architecture, which leverage state space modeling to achieve strong performance. Previous studies have primarily focused on network design while overlooking the importance of position encoding and relying on coarse-grained geometric feature aggregation. The former leads to semantic ambiguity due to inconsistent spatial relationships, while the latter results in geometric feature dispersion by overlooking fine-grained local geometric details. To tackle the above problem, we propose a novel framework, PointMC, including Multi-view Consistent Learnable Position Encoding (MCLPE) and Center-Global Feature Fusion (CGFF), to provide semantically coherent positional guidance for inter-patch and enable fine-grained geometric structure aggregation within intra-patch regions. Specifically, the proposed MCLPE module is inspired by a spatial structure modeling mechanism guided by physical constraints, leverages multi-view virtual reconstruction and a learnable strategy to dynamically constrain spatial relationships along patch boundaries, thereby enhancing the semantic consistency and representational clarity across inter-patch regions. Furthermore, considering the lack of local structural information within each patch, the CGFF module employs a dual-guidance mechanism based on center and global structures to effectively promote the aggregation of local geometric features. Extensive experiments on multiple benchmark datasets validate the effectiveness of PointMC, consistently outperforming existing state-of-the-art methods, and demonstrating superior capability in capturing both inter-patch semantic consistency and intra-patch geometric details.

PointMC: Multi-view Consistent Encoding and Center-Global Feature Fusion for Point Clouds Understanding

Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.

LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. 
Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. 
To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. 
To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. 
Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles.
Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. 
Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. 
Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. 
Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. 
Additional audio samples of VoiceCloak are available in supplementary material for auditory demonstration.

Downloads

Next from AAAI 2026

Diffusion Once and Done: Degradation-Aware LoRA for All-in-One Image Restoration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Diffusion Once and Done: Degradation-Aware LoRA for All-in-One Image Restoration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads