Singapore

Despite the progress made through deep learning, existing Visual Object Tracking (VOT) frameworks struggle with real-world challenges. Recent approaches incorporate additional modalities like Depth, Thermal Infrared, and Language to enhance the robustness of VOT, particularly with the improvement of the depth sensor precision, facilitating RGB-D tracking. However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba&#39;s linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. Our innovation also includes a low-parameter Multimodal Mix Mamba (3M) module, which optimizes deep feature fusion and reduces computational overhead. The advantage of the 3M module stems from our Multimodal State Space Model (MSSM), a multimodal feature interaction component reconstructed based on SSM. Experiments across multiple RGB-D tracking datasets indicate that AMTrack achieves superior performance with lower parameters and memory demands compared to state-of-the-arts. Notably, our method is generalized in multi-modal tracking.

AAAI 2026

Exploiting All Mamba Fusion for Efficient RGB-D Tracking

Despite the progress made through deep learning, existing Visual Object Tracking (VOT) frameworks struggle with real-world challenges. Recent approaches incorporate additional modalities like Depth, Thermal Infrared, and Language to enhance the robustness of VOT, particularly with the improvement of the depth sensor precision, facilitating RGB-D tracking. However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba's linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. Our innovation also includes a low-parameter Multimodal Mix Mamba (3M) module, which optimizes deep feature fusion and reduces computational overhead. The advantage of the 3M module stems from our Multimodal State Space Model (MSSM), a multimodal feature interaction component reconstructed based on SSM. Experiments across multiple RGB-D tracking datasets indicate that AMTrack achieves superior performance with lower parameters and memory demands compared to state-of-the-arts. Notably, our method is generalized in multi-modal tracking.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Image restoration has made great progress with the rise of deep learning, but its energy consumption limits its real-world applications. Spiking Neural Networks (SNNs) are seen as energy-efficient alternatives to Artificial Neural Networks (ANNs). Applying SNNs to image restoration (IR) remains challenging, primarily due to the limited information capacity of spike-based signals. This limitation leads to quantization errors and information loss, while IR tasks are highly sensitive to output precision and error. Thus, the restoration performance suffers significantly. To address this challenge, we propose SpikingIR, an ANN-to-SNN conversion framework for IR that reduces information loss and quantization error. SpikingIR mainly consists of two components: Convolutional Pixel Mapping (CPM) and Membrane Potential Reuse Neuron (MPRN), which are designed to alleviate quantization errors and information loss in the output and intermediate layers, respectively. Specifically, CPM maps discrete outputs into a continuous space, better aligning with pixel-level details. From the perspective of information entropy, we show that outputs of CPM contain more information than the original outputs. MPRN introduces a post-processing step with relaxed firing conditions to extract residual membrane potential, reducing information waste. Furthermore, we fine-tune the converted model to jointly optimize both accuracy and energy efficiency. Experimental results demonstrate that SpikingIR achieves performance comparable to ANN counterparts across various IR benchmarks while reducing energy consumption by up to 50\%.

SpikingIR: A Novel Converted Spiking Neural Network for Efficient Image Restoration

A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.

Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

Adapting computational pathology models to evolving clinical diagnostics via Class-Incremental Semantic Segmentation (CISS) is critical. However, this task imposes a unique CISS Trilemma: a simultaneous failure to preserve the intricate tissue background (stability), distinguish morphologically similar new nuclei (plasticity), and maintain a constant model size (scalability), all under a strict exemplar-free constraint. To resolve this, we introduce $\textit{Palimpsest}$, a novel framework that systematically decouples these conflicting demands. $\textit{Palimpsest}$ integrates three synergistic mechanisms: a Parameter-Conserving Synthesis (PCS) module merges lightweight adapters to ensure scalability; a novel Similarity-Aware Centroid Recalibration (SCR) module executes differentiated recalibration to counteract non-uniform foreground drift, securing plasticity; and an Adaptive Residual Shading (ARS) module performs logit-space decoupling to preserve background integrity, ensuring stability. Extensive experiments on two histopathology datasets demonstrate that $\textit{Palimpsest}$ significantly outperforms state-of-the-art methods, achieving a superior stability-plasticity balance, particularly in challenging long-term incremental scenarios. Code will be made publicly available.

Palimpsest: Reconciling the CISS Trilemma for Incremental Nuclei Segmentation

Aligning molecular sequence representations (e.g., SMILES notations) with textual descriptions is critical for applications spanning drug discovery, materials design, and automated chemical literature analysis. Existing methodologies typically treat molecular captioning (molecule-to-text) and text-based molecular design (text-to-molecule) as separate tasks, relying on supervised fine-tuning or contrastive learning pipelines. These approaches face three key limitations: (i) conventional metrics like BLEU prioritize linguistic fluency over chemical accuracy, (ii) training datasets frequently contain chemically ambiguous narratives with incomplete specifications, and (iii) independent optimization of generation directions leads to bidirectional inconsistency. To address these issues, we propose RTMol, a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning. The framework introduces novel round-trip evaluation metrics and enables unsupervised training for molecular captioning without requiring paired molecule-text corpora. Experiments demonstrate that RTMol enhances bidirectional alignment performance by up to 47\% across various LLMs, establishing an effective paradigm for joint molecule-text understanding and generation.

RTMol: Rethinking Molecule-text Alignment in a Round-trip View

While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Recent advances in spatial transcriptomics have enabled the integration of gene expression profiles with precise spatial coordinates, which have facilitated the exploration of tumor occurrence and development mechanisms, as well as the development of more effective targeted and immunotherapy approaches for tumor treatment. Deciphering cell type represents a critical challenge in spatial transcriptomics research. Existing methods are limited by the pervasive “dropout” events in spatial transcriptomics, hindering their ability to fully capture the relationship between spatial location and gene expression, thereby compromising the performance of cell type deconvolution. To address these limitations, we propose a spatial-aware masked graph transformer-diffusion model (SAMGTD) for enhanced cell type deconvolution in spatial transcriptomics. For spatial transcriptomics, the masked graph transformer model is designed to adaptively capture complex dependencies between spatial locations and gene expression. It employs a masking strategy that guides the model to focus on important local information during training, while the multi-head attention mechanism captures global context. More importantly, the spatial diffusion model is constructed to achieve the dual enhancement of spatial transcriptomics, including denoising and data imputation. It incorporates the multi-head attention mechanism and residual blocks, effectively addressing the “dropout” issue commonly encountered in spatial transcriptomics. For scRNA-seq, we construct a variational autoencoder to reduce noise interference while preserving key gene expression information. Finally, we construct a spatial-aware contrastive learning model to integrate scRNA-seq and spatial transcriptomics for cell type deconvolution. Experiments conducted on three datasets demonstrate that SAMGTD outperforms existing state-of-the-art methods.

SAMGTD: Spatial-Aware Masked Graph Transformer-Diffusion Model for Enhanced Cell Type Deconvolution in Spatial Transcriptomics

The goal of inductive program synthesis is for a machine to automatically generate a program from user-supplied examples. A key underlying assumption is that humans can provide sufficient examples to teach a concept to a machine. To evaluate the validity of this assumption, we conduct a study where human participants provide examples for six programming concepts, such as finding the maximum element of a list. We evaluate the generalisation performance of five program synthesis systems trained on input-output examples (i) from non-expert humans, (ii) from a human expert, and (iii) randomly sampled. Our results suggest that non-experts typically do not provide sufficient examples for a program synthesis system to learn an accurate program.

Can Humans Teach Machines to Code?

Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instructions—remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives. To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross-modal reasoning. Via a Hierarchical Prediction-Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision-and-language features; MWM then infers post-action visual states to guide the second layer’s fine-grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM’s reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state-of-the-art methods by 2.1\% and 1.2\% in navigation accuracy for unseen scenes, validating its effectiveness.

UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model

Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T → 6.70T↓), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

Accelerating Controllable Generation via Hybrid-grained Cache

We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. To evaluate our proposed method, we collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.

Next from AAAI 2026

SpikingIR: A Novel Converted Spiking Neural Network for Efficient Image Restoration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES