Singapore

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image&#39;s complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

AAAI 2026

Top-Down Semantic Refinement for Image Captioning

top-down reasoning

image caption

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. 
However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding, which is essential for agents to perceive their surroundings, adapt to intricate layouts, and make informed decisions based on spatial relationships.
To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents' navigation capabilities, enabling more effective interactions in real-world scenarios.
To support this task, we have generated a spatial navigation dataset consisting of 10,000 trajectories within the AI2THOR simulator, with 5,000 trajectories allocated to each component. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. By offering diverse scenarios and rich contextual information, this dataset aims to facilitate improved learning and adaptability for embodied agents in complex environments.
Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework designed to embody the principle of "What You See is What You Reach." SpNav employs a vision-language model (VLM) to interpret high-level human instructions and accurately identify target objects or areas within the observation range. It subsequently achieves precise point-to-point navigation using a spatial map, thereby successfully completing the spatial navigation task. This framework enhances the agent's ability to operate effectively in complex environments, bridging the gap between perception and action.
Extensive experiments demonstrate that SpNav not only achieves state-of-the-art performance in spatial navigation tasks, surpassing all baseline methods, but also showcases remarkable zero-shot simulation-to-reality transfer capabilities, highlighting its potential for real-world deployment and practical applications in embodied AI.
To support ongoing research in this field, we will release the dataset, benchmark, and source code, enabling the community to build upon our work and explore new avenues for advancement.

What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed PPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. The complete codebase will be made publicly available.

Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

Deep unrolling models (DUMs) have shown great potential in sparse-view CT reconstruction by combining iterative optimization and deep learning. 
However, most DUMs insufficiently account for physical degradation from sparse-view imaging, leading to slow convergence and persistent artifacts.
To address this, we propose PAUM, a Physics-Aware Accelerated Unrolling Model explicitly incorporating CT imaging physics into the iterative reconstruction.
PAUM introduces a Dual-Domain Physics-Aware Extrapolation (DDPE) module.
By modeling dual-domain degradations, it performs row-wise extrapolation in the sinogram domain to improve missing view recovery, and pixel-wise extrapolation in the image domain to address spatially variant degradation from incomplete backprojection.
This physics-aware extrapolation aligns optimization dynamics with underlying physical imaging degradation, significantly accelerating convergence.
Subsequently, we develop a lightweight Block-Attention Deformable Regularization Network (BDRN), leveraging deformable convolutions and block-wise attention to model spatially variant and structured artifact physical characteristics.
This enables spatially adaptive regularization on extrapolated results, effectively improving reconstruction quality.
Extensive experiments demonstrate PAUM achieves over 1dB PSNR improvement compared to SOTA methods, while reducing iteration count by 50\%. Code will be released.

Physics-Aware Accelerated Unrolling Model for Sparse-View CT Reconstruction

Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled $k$-$t$ space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. 
In this work, we proposed MoCo‑INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion‑compensated (MoCo) framework. Using the explicit motion modeling and the continuous prior of INRs, our MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Moreover, we present a new INR network architecture tailored to the CMR problem, which can greatly stabilize model optimization.
Experiments on retrospective (*i.e.*, simulated) datasets demonstrate the superiority of MoCo‑INR over state‑of‑the‑art methods, achieving fast convergence and fine‑detailed reconstructions at ultra‑high acceleration factors (*e.g.*, 20$\times$ in VISTA sampling).
In addition, evaluations on prospective (*i.e.*, real-acquired) free‑breathing CMR scans highlight its clinical practicality for real‑time imaging. Several ablation studies also confirm the effectiveness of critical components of MoCo-INR. The code will be publicly released for improving reproducibility.

Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation

Video-based human pose estimation aims to localize keypoints across frames, enabling robust analysis of human motion in applications such as sports, surveillance, and healthcare. However, existing methods rely solely on visual cues, limiting their robustness in complex scenes involving occlusion, motion blur, or poor lighting. In contrast, dual coding theory from psychology suggests that human cognition is inherently multimodal: we learn by integrating visual perception with linguistic context to form structured, semantic understandings of the world. Visual input provides concrete spatiotemporal grounding, while language offers symbolic abstraction that enhances reasoning and generalization. Motivated by this cognitive principle, we present the first framework that explicitly incorporates language as an auxiliary modality to enhance video-based pose estimation. To address the lack of paired video-text datasets, we first employ a Multimodal Large Language Model (MLLM) to generate textual descriptions of human interactions from videos. We then propose a novel coarse-to-fine multimodal alignment pipeline: a cross-modal semantic interaction module establishes initial grounding between spatiotemporal visual features and textual embeddings, while an optimal transport-based feature matching mechanism enforces fine-grained, geometry-aware alignment. This cognitively inspired design enables more accurate and robust pose estimation, especially in visually challenging scenes like occlusion and motion blur. Extensive experiments on three benchmarks confirm that our method consistently outperforms state-of-the-art approaches. Our code is released and included in the supplementary materials.

Dual Coding Theory in Action: Language-Assisted Human Pose Estimation in Videos

User purchase decisions are driven by complex, multi-faceted intentions that evolve across different temporal horizons (e.g., immediate needs, transitional interests, and long-term preferences). However, existing sequential methods often treat user sequences as unified blocks, overlooking the dynamic evolution of user intents at different granularities, while also lacking robustness against prevalent noise in real-world interaction data. This paper proposes Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation (MIMAR-SRec), a framework that models latent user intentions at multiple granularities. Specifically, MIMAR-SRec integrates multi-granularity intent representation across different contextual windows to capture evolving user interests, dual-perspective contrastive learning that aligns user representations with both intent prototypes and cross-user sequences, and intent-similarity adversarial robustness that systematically enhances model stability against interaction, temporal, and preference noise through controlled perturbations. By integrating multi-granularity intent modeling with adversarial training, MIMAR-SRec enables simultaneous fine-grained underlying intent modeling and noise-resistant recommendations. Extensive experiments on four widely used benchmark datasets demonstrate that MIMAR-SRec outperforms state-of-the-art baselines, particularly in long-tail item recommendation and noisy interaction scenarios. Our code is available in the appendix and will be open-sourced upon paper acceptance.

Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation

Multi-modal salient object detection (SOD) shows an improvement over its uni-modal counterpart by exploiting the complementary benefits between modalities. However, this improvement relies on complete multi-modal information, which is difficult to be guaranteed in practice due to sensor failures and transmission errors. To address this issue, we propose a robust multi-modal SOD framework that enhances the adaptability to modality-missing situations, while maintaining comparable performance in modality-complete cases. Nevertheless, flexibly handling modality-missing and modality-complete situations and integrating their corresponding multi-modal features in a unified framework is non-trivial. To this end, we achieve this framework by designing a Cascaded Mixture-of-Experts (CMoE) network that sequentially incorporates missing-aware and multi-modal MoE. Specifically, the missing-aware MoE introduces zero, copy, and alter experts with a soft router to adaptively reconstruct feature representations for both missing and non-missing modalities, assisted by a expert modulation loss that guides the router to modulate the weights of different experts according to missing conditions. The multi-modal MoE introduces two homogeneous uni-modal experts that separately learn modality-specific knowledge tailored for different modalities and dynamically combines their output through the soft router. The cascaded architecture fully empowers CMoE with the flexibility across varying input cases. Extensive experiments on RGB-D and RGB-T SOD datasets, with both modality-missing and modality-complete settings, demonstrate the effectiveness of the proposed method. Code and models will be made publicly available.

Taming Cascaded Mixture-of-Experts for Modality-missing Multi-modal Salient Object Detection

Accurate muscle-mass assessment is crucial for staging and managing sarcopenia, yet existing methods suffer from modality-specific limitations and weak integration of muscle function indicators. To solve these limitations, we propose a Dual-source Features Graph for Sarcopenia Evaluation (DFGSE) to synergize high- and low-energy whole-body Dual-energy X-ray Absorptiometry (DXA) images, local high-energy DXA images, and blood-borne biochemical markers. Specifically, the feature extraction module employs dual-energy feature extraction to disentangle soft-tissue and skeletal cues from low-energy images, while skeleton-aware detection extracts joint features from high-energy images. It yields global and local DXA embeddings, complemented by blood-test representations. In the relevance exploration module, inter- and intra-modality correlations are computed via bilinear transformations to form adjacency matrices for the global, local, and blood modality representations. These matrices seed the Multi-type Multi-relation Graph Convolutional Network (MMGCN) – the core of the relation learning module – which captures both direct and indirect interactions among modalities through relation-aware message passing. Finally, the graph-fused representations are used by a muscle-mass prediction head trained with cross-entropy loss. Experiments on the public MURA dataset and two independent sarcopenia cohorts demonstrate that DFGSE consistently outperforms machine learning and state-of-the-art graph-based methods, in terms of four evaluation metrics for classification task.

Sarcopenia Assessment Model Based on Dual-Source Modal Graph

As large language models (LLMs) continue to improve in reasoning and decision-making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated. We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios. VirtualEnv supports rich agent–environment interactions, including object manipulation, navigation, and adaptive multi-agent collaboration, as well as game-inspired mechanics like escape rooms and procedurally generated environments. We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents using natural language instructions. We integrate large-scale LLMs and vision-language models (VLMs), such as GPT-based models, to generate novel environments and structured tasks from multimodal inputs. Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi-agent coordination. We also describe our methodology for procedural task generation, task validation, and real-time environment control. VirtualEnv is released as an open-source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment.

VirtualEnv: A Platform for Embodied AI Research

Federated Deep Reinforcement Learning (FDRL) aims to enable distributed collaborative training of multiple DRL models while preserving privacy. Existing FDRL methods function in static client environments, but real-world scenarios often involve dynamic state transitions, such as noise, which render static model topologies inadequate and result in biased policy loss. This degrades client performance and leads to suboptimal global policies. To address this challenge, we develop a generic solution, referred to as the self-regulating training framework, which can be seamlessly integrated into existing FDRL approaches to address dynamic state transitions. Specifically, we propose a Sparse Training (ST) method that dynamically sparsifies and adjusts the topology of each model during training to maximize model performance and reduce model complexity. Additionally, we introduce an auxiliary model to adaptively regulate the policy loss of client models, mitigating loss bias and facilitating updates that yield improved returns. Experimental results demonstrate that our method enhances six state-of-the-art (SOTA) FDRL approaches across nine tasks in terms of return.

Content not yet available

Next from AAAI 2026

What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES