Singapore

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively compute within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-$\alpha$ and DiT demonstrate that ProCache achieves up to 1.96$\times$ and 2.90$\times$ acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

AAAI 2026

ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

diffusion acceleration，visual generation，training-free

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

6-DoF object grasping is a crucial skill for embodied intelligent robots. Previous methods often rely on large-scale networks for feature extraction, followed by grasp pose prediction, which increases the network's parameter count and overlooks the geometric and graph features of the point cloud. To address these challenges, we propose GraphGrasp, a graph-guided 6-DoF grasping pose prediction method. It performs graph analysis from the perspectives of scene, object, and grasping graphs. First, we introduce a graph feature embedding method based on local-global features to model the scene graph effectively. Then, we use a graph transformer strategy to represent spatial relationships between objects in the object graph. Finally, we propose a multi-metric, multi-level grasp pose evaluation algorithm to predict and explore graspable points, enabling effective construction of grasp graphs and accurate grasp pose evaluation. We test GraphGrasp on the GraspNet-1Billion dataset, and the results show that, compared to previous methods, it achieves nearly the same performance with about $\frac{1}{5}$ of the parameters of state-of-the-art methods, significantly improving grasp pose prediction speed. Additionally, in real-world robot grasping scenarios, GraphGrasp outperforms previous methods in practical grasp pose prediction tasks.

GraphGrasp: Lightweight and Efficient Graph-Guided 6-DoF Robotic Grasp Pose Estimation Network

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

Top-Down Semantic Refinement for Image Captioning

Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. 
However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding, which is essential for agents to perceive their surroundings, adapt to intricate layouts, and make informed decisions based on spatial relationships.
To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents' navigation capabilities, enabling more effective interactions in real-world scenarios.
To support this task, we have generated a spatial navigation dataset consisting of 10,000 trajectories within the AI2THOR simulator, with 5,000 trajectories allocated to each component. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. By offering diverse scenarios and rich contextual information, this dataset aims to facilitate improved learning and adaptability for embodied agents in complex environments.
Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework designed to embody the principle of "What You See is What You Reach." SpNav employs a vision-language model (VLM) to interpret high-level human instructions and accurately identify target objects or areas within the observation range. It subsequently achieves precise point-to-point navigation using a spatial map, thereby successfully completing the spatial navigation task. This framework enhances the agent's ability to operate effectively in complex environments, bridging the gap between perception and action.
Extensive experiments demonstrate that SpNav not only achieves state-of-the-art performance in spatial navigation tasks, surpassing all baseline methods, but also showcases remarkable zero-shot simulation-to-reality transfer capabilities, highlighting its potential for real-world deployment and practical applications in embodied AI.
To support ongoing research in this field, we will release the dataset, benchmark, and source code, enabling the community to build upon our work and explore new avenues for advancement.

What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed PPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. The complete codebase will be made publicly available.

Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

Deep unrolling models (DUMs) have shown great potential in sparse-view CT reconstruction by combining iterative optimization and deep learning. 
However, most DUMs insufficiently account for physical degradation from sparse-view imaging, leading to slow convergence and persistent artifacts.
To address this, we propose PAUM, a Physics-Aware Accelerated Unrolling Model explicitly incorporating CT imaging physics into the iterative reconstruction.
PAUM introduces a Dual-Domain Physics-Aware Extrapolation (DDPE) module.
By modeling dual-domain degradations, it performs row-wise extrapolation in the sinogram domain to improve missing view recovery, and pixel-wise extrapolation in the image domain to address spatially variant degradation from incomplete backprojection.
This physics-aware extrapolation aligns optimization dynamics with underlying physical imaging degradation, significantly accelerating convergence.
Subsequently, we develop a lightweight Block-Attention Deformable Regularization Network (BDRN), leveraging deformable convolutions and block-wise attention to model spatially variant and structured artifact physical characteristics.
This enables spatially adaptive regularization on extrapolated results, effectively improving reconstruction quality.
Extensive experiments demonstrate PAUM achieves over 1dB PSNR improvement compared to SOTA methods, while reducing iteration count by 50\%. Code will be released.

Physics-Aware Accelerated Unrolling Model for Sparse-View CT Reconstruction

Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled $k$-$t$ space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. 
In this work, we proposed MoCo‑INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion‑compensated (MoCo) framework. Using the explicit motion modeling and the continuous prior of INRs, our MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Moreover, we present a new INR network architecture tailored to the CMR problem, which can greatly stabilize model optimization.
Experiments on retrospective (*i.e.*, simulated) datasets demonstrate the superiority of MoCo‑INR over state‑of‑the‑art methods, achieving fast convergence and fine‑detailed reconstructions at ultra‑high acceleration factors (*e.g.*, 20$\times$ in VISTA sampling).
In addition, evaluations on prospective (*i.e.*, real-acquired) free‑breathing CMR scans highlight its clinical practicality for real‑time imaging. Several ablation studies also confirm the effectiveness of critical components of MoCo-INR. The code will be publicly released for improving reproducibility.

Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation

Video-based human pose estimation aims to localize keypoints across frames, enabling robust analysis of human motion in applications such as sports, surveillance, and healthcare. However, existing methods rely solely on visual cues, limiting their robustness in complex scenes involving occlusion, motion blur, or poor lighting. In contrast, dual coding theory from psychology suggests that human cognition is inherently multimodal: we learn by integrating visual perception with linguistic context to form structured, semantic understandings of the world. Visual input provides concrete spatiotemporal grounding, while language offers symbolic abstraction that enhances reasoning and generalization. Motivated by this cognitive principle, we present the first framework that explicitly incorporates language as an auxiliary modality to enhance video-based pose estimation. To address the lack of paired video-text datasets, we first employ a Multimodal Large Language Model (MLLM) to generate textual descriptions of human interactions from videos. We then propose a novel coarse-to-fine multimodal alignment pipeline: a cross-modal semantic interaction module establishes initial grounding between spatiotemporal visual features and textual embeddings, while an optimal transport-based feature matching mechanism enforces fine-grained, geometry-aware alignment. This cognitively inspired design enables more accurate and robust pose estimation, especially in visually challenging scenes like occlusion and motion blur. Extensive experiments on three benchmarks confirm that our method consistently outperforms state-of-the-art approaches. Our code is released and included in the supplementary materials.

Dual Coding Theory in Action: Language-Assisted Human Pose Estimation in Videos

User purchase decisions are driven by complex, multi-faceted intentions that evolve across different temporal horizons (e.g., immediate needs, transitional interests, and long-term preferences). However, existing sequential methods often treat user sequences as unified blocks, overlooking the dynamic evolution of user intents at different granularities, while also lacking robustness against prevalent noise in real-world interaction data. This paper proposes Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation (MIMAR-SRec), a framework that models latent user intentions at multiple granularities. Specifically, MIMAR-SRec integrates multi-granularity intent representation across different contextual windows to capture evolving user interests, dual-perspective contrastive learning that aligns user representations with both intent prototypes and cross-user sequences, and intent-similarity adversarial robustness that systematically enhances model stability against interaction, temporal, and preference noise through controlled perturbations. By integrating multi-granularity intent modeling with adversarial training, MIMAR-SRec enables simultaneous fine-grained underlying intent modeling and noise-resistant recommendations. Extensive experiments on four widely used benchmark datasets demonstrate that MIMAR-SRec outperforms state-of-the-art baselines, particularly in long-tail item recommendation and noisy interaction scenarios. Our code is available in the appendix and will be open-sourced upon paper acceptance.

Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation

Multi-modal salient object detection (SOD) shows an improvement over its uni-modal counterpart by exploiting the complementary benefits between modalities. However, this improvement relies on complete multi-modal information, which is difficult to be guaranteed in practice due to sensor failures and transmission errors. To address this issue, we propose a robust multi-modal SOD framework that enhances the adaptability to modality-missing situations, while maintaining comparable performance in modality-complete cases. Nevertheless, flexibly handling modality-missing and modality-complete situations and integrating their corresponding multi-modal features in a unified framework is non-trivial. To this end, we achieve this framework by designing a Cascaded Mixture-of-Experts (CMoE) network that sequentially incorporates missing-aware and multi-modal MoE. Specifically, the missing-aware MoE introduces zero, copy, and alter experts with a soft router to adaptively reconstruct feature representations for both missing and non-missing modalities, assisted by a expert modulation loss that guides the router to modulate the weights of different experts according to missing conditions. The multi-modal MoE introduces two homogeneous uni-modal experts that separately learn modality-specific knowledge tailored for different modalities and dynamically combines their output through the soft router. The cascaded architecture fully empowers CMoE with the flexibility across varying input cases. Extensive experiments on RGB-D and RGB-T SOD datasets, with both modality-missing and modality-complete settings, demonstrate the effectiveness of the proposed method. Code and models will be made publicly available.

Taming Cascaded Mixture-of-Experts for Modality-missing Multi-modal Salient Object Detection

Accurate muscle-mass assessment is crucial for staging and managing sarcopenia, yet existing methods suffer from modality-specific limitations and weak integration of muscle function indicators. To solve these limitations, we propose a Dual-source Features Graph for Sarcopenia Evaluation (DFGSE) to synergize high- and low-energy whole-body Dual-energy X-ray Absorptiometry (DXA) images, local high-energy DXA images, and blood-borne biochemical markers. Specifically, the feature extraction module employs dual-energy feature extraction to disentangle soft-tissue and skeletal cues from low-energy images, while skeleton-aware detection extracts joint features from high-energy images. It yields global and local DXA embeddings, complemented by blood-test representations. In the relevance exploration module, inter- and intra-modality correlations are computed via bilinear transformations to form adjacency matrices for the global, local, and blood modality representations. These matrices seed the Multi-type Multi-relation Graph Convolutional Network (MMGCN) – the core of the relation learning module – which captures both direct and indirect interactions among modalities through relation-aware message passing. Finally, the graph-fused representations are used by a muscle-mass prediction head trained with cross-entropy loss. Experiments on the public MURA dataset and two independent sarcopenia cohorts demonstrate that DFGSE consistently outperforms machine learning and state-of-the-art graph-based methods, in terms of four evaluation metrics for classification task.

Content not yet available

Next from AAAI 2026

GraphGrasp: Lightweight and Efficient Graph-Guided 6-DoF Robotic Grasp Pose Estimation Network

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES