Singapore

Egocentric human pose estimation (HPE) plays a crucial role in immersive applications such as virtual and augmented reality. However, existing methods relying on either visual or sparse inertial data alone often suffer from occlusion or ill-posed problems. In this work, we propose SAME, a novel spatial-aware multimodal fusion framework combining the complementary signals from the stereo images and sparse IMUs for accurate and robust egocentric HPE. It adopts a two-stage network based on a dual coordinate frame to mitigate the coordinate inconsistencies among the stereo cameras and the IMUs. In the first stage, the IMU signals are transformed into the local frame and iteratively fused with the stereo images for estimating 3D poses in the local frame. In the second stage, the local poses are transformed into the global frame with the 6DOF head poses provided by the head-mounted display&#39;s (HMD) SLAM algorithm and then temporally aggregated via a temporal Transformer network. Meanwhile, to achieve geometric and semantic alignment among multi-modal features, we present a depth-guided spatial-aware deformable stereo attention network and a modality-aware Transformer decoder for cross-view and cross-modal feature fusion. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on the public EMHI multi-modal egocentric pose estimation benchmark.

AAAI 2026

SAME: Spatial-Aware Multimodal Egocentric Human Pose Estimation

gesture & pose

multi-modal vision

motion & tracking

face

biometrics

Egocentric human pose estimation (HPE) plays a crucial role in immersive applications such as virtual and augmented reality. However, existing methods relying on either visual or sparse inertial data alone often suffer from occlusion or ill-posed problems. In this work, we propose SAME, a novel spatial-aware multimodal fusion framework combining the complementary signals from the stereo images and sparse IMUs for accurate and robust egocentric HPE. It adopts a two-stage network based on a dual coordinate frame to mitigate the coordinate inconsistencies among the stereo cameras and the IMUs. In the first stage, the IMU signals are transformed into the local frame and iteratively fused with the stereo images for estimating 3D poses in the local frame. In the second stage, the local poses are transformed into the global frame with the 6DOF head poses provided by the head-mounted display's (HMD) SLAM algorithm and then temporally aggregated via a temporal Transformer network. Meanwhile, to achieve geometric and semantic alignment among multi-modal features, we present a depth-guided spatial-aware deformable stereo attention network and a modality-aware Transformer decoder for cross-view and cross-modal feature fusion. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on the public EMHI multi-modal egocentric pose estimation benchmark.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Shapley values are widely recognized as a principled method for attributing importance to input features in machine learning. However, the exact computation of Shapley values scales exponentially with the number of features, severely limiting the practical application of this powerful approach. The challenge is further compounded when the predictive model is probabilistic---as in Gaussian processes (GPs)---where the outputs are random variables rather than point estimates, necessitating additional computational effort in modeling higher-order moments. In this work, we demonstrate that for an important class of GPs known as FANOVA GP, which explicitly models all main effects and interactions, exact Shapley attributions for both local and global explanations can be computed in *quadratic* time. For *local, instance-wise explanations*, we define a stochastic cooperative game over function components and compute the *exact stochastic Shapley value* in quadratic time only, capturing both the expected contribution and uncertainty. For *global explanations*, we introduce a deterministic, variance-based value function and compute exact Shapley values that quantify each feature’s contribution to the model’s overall sensitivity. Our methods leverage a closed-form (stochastic) Möbiusrepresentation of the FANOVA decomposition and introduce recursive algorithms, inspired by Newton's identities, to efficiently compute the mean and variance of Shapley values. Our work enhances the utility of explainable AI, as demonstrated by empirical studies, by providing more scalable, axiomatically sound, and uncertainty-aware explanations for predictions generated by structured probabilistic models.

Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes

3D multi-human reconstruction from single images holds significant potential for advancing AR/VR applications. While remarkable progress has been made in single-human reconstruction, existing methods face challenges when reconstructing multiple humans. These challenges include: (1) severe inter-occlusion that disrupts individual body structures, and (2) the absence of physically plausible relative positioning among subjects.
We present DECON, a novel DEcouple-and-reCONstruct framework that systematically addresses these limitations through two technical innovations: (1) a decouple-and-reconstruct framework with multi-view synthesis. It separates individuals and reconstructs detailed 3D bodies from a single image. (2) a Perspective-Aware Position Optimization (PAPO) approach. It ensures realistic positioning by fixing overlaps and gaps between subjects.
Extensive experiments demonstrate our method's capability to reconstruct fully separated, anatomically complete 3D humans with clothed-geometric details and plausible interactions. Quantitative evaluations show a 54\% reduction in Chamfer Distance and 35\% in Point-to-Surface Distance compared to state-of-the-art methods. Our source code will be publicly released.

DECON: Reconstruction of Clothed-Geometric Multiple Humans from a Single Image via Geometry-Guided Decoupling

Bokeh is used in photography to emphasize the selected subject by smoothly blurring the out-of-focus region with appealing highlights. While recent advances have achieved impressive results in rendering realistic blur, existing frameworks typically rely on disparity maps and bokeh-relevant inputs (e.g., focal distance and blur size), and face significant challenges in video bokeh rendering due to limited temporal consistency. In this paper, we propose BokehCrafter, the first video diffusion framework that generates temporally coherent and visually pleasing bokeh effects from all-in-focus video inputs under user-friendly input conditions. Specifically, we leverage a dual-stream attention mechanism, integrating a reference image branch and a rendering instruction branch. We propose a Bokeh Image Extraction (BIE) module and a CLIP-based text encoder to extract image and text features, respectively, whose outputs are fused via a Text-Image Fusion (TIF) module to enable fine-grained and controllable bokeh rendering. To support the novel capabilities of our model, we construct Video Bokeh Scenes (VBS), a large-scale dataset containing a wide variety of bokeh videos with corresponding rendering instructions, across various scenes and rendering settings. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods in both bokeh rendering quality and temporal consistency.

BokehCrafter: Taming Video Diffusion Models for Controllable Bokeh Rendering

Recent advances in vision language models (VLMs) have demonstrated remarkable potential in embodied navigation tasks. However, existing robot-centric datasets primarily focus on traditional 3D tasks such as perception and prediction, lacking adequate support for vision-language tasks. Vision-language-navigation (VLN) is a key capability for achieving human-like and interpretable navigation in complex environments. In this study, we present CoT-VLNBench, the first large-scale benchmark and dataset designed for chain-of-thought (CoT) reasoning in quadruped robot navigation. Our dataset encompasses a diverse range of indoor and outdoor scenes, multi-step navigation trajectories, and rich natural language instructions, all annotated with fine-grained CoT reasoning traces. Specifically, it contains 175K frames, 5.25M 3D bounding boxes, and 875K vision–question–answer (VQA) pairs. This comprehensive resource enables thorough evaluation of embodied agents’ perceptual and step-by-step reasoning abilities. Furthermore, we propose a novel CoT-VLN model, a state-of-the-art 7B VLN model that integrates visual, linguistic, and reasoning modules, to facilitate interpretable and effective navigation. Extensive experiments demonstrate that our approach significantly outperforms existing non-VLMs baselines on the new benchmark, underscoring the importance of CoT-VLN in embodied navigation. We hope that CoT-VLNBench will serve as a valuable resource to advance research at the intersection of robotics, vision, language, and reasoning.

CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots

Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM.

LLMC+: Benchmarking Vision-Language Model Compression with a plug-and-play Toolkit

Automatic Cued Speech Recognition (ACSR) is a vital communication system designed to enhance spoken language accessibility for the hearing-impaired by combining lip movements and hand gestures to encode phonemes. Despite its effectiveness, current ACSR methods face significant challenges, including poor generalization to unseen cuers due to the limited scale of CS datasets, which restricts the ability of existing visual encoder to capture cuer-invariant CS visual features. Additionally, previous approaches relying on Connectionist Temporal Classification (CTC) decoding fail to incorporate prior linguistic sequence knowledge, further limiting their performance. To address these issues, we propose a novel Two Auxiliary Modalities guided Cross-cuer Invariant Adaptation method (TACIA), introducing pose and text modalities to help extract cuer-invariant motion and semantic features, thereby improving generalization. In addition, we introduce a Visual-guided Cued Token Prediction (VG-NTP) method, inspired by large language models. This method replaces CTC decoding by incorporating language modeling, leveraging rich linguistic knowledge, including semantics, to address the suboptimal issues present in the CTC decoding process. Extensive experiments demonstrate the superiority of our approach to the state-of-the-art (SOTA) on Chinese and British CS datasets, significantly advancing the accuracy and quality of ACSR systems.

Cueing Without Gapping: Cuer-Independent Cued Speech Recognition Powered by Cross-Cuer Invariant Modeling

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multimodal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose \textbf{UI-R1}, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. UI-R1 introduces a novel rule-based action reward scheme, facilitating model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). To further improve efficiency during inference, we present \textbf{UI-R1-E}fficient, a two-stage training paradigm that both shortens reasoning length and enhances overall performance. Additionally, we construct a compact yet high-quality dataset comprising 2K challenging tasks across five prevalent mobile device action types. Experimental results show that our proposed models (e.g., UI-R1-3B) achieve substantial improvements over the base model (i.e., Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of \textbf{18.3\%} on ScreenSpot, \textbf{6.0\%} on ScreenSpot-Pro, and \textbf{10.9\%} on \textsc{AndroidControl}. Moreover, our efficient versions deliver competitive performance compared to considerably larger state-of-the-art models. These results underscore the potential of reinforcement learning to advance GUI control, paving the way for future research in Human-Computer Interaction (HCI).

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Chest X-ray report generation aims to reduce radiologists' workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge---including clinical context (e.g., symptoms, medical history) and the most recent prior image---which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder's hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.

PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation

Infrared and visible image fusion aims to integrate complementary information, such as thermal saliency from infrared imagery and fine-grained texture details from visible imagery. However, real-world multi-modal misalignment and geometric deformation often introduce severe artifacts. Most existing methods focus on feature extraction within Euclidean space, thereby neglecting the inherent hierarchical structures embedded in multimodal representations. While Euclidean space excels at preserving local structural details and supporting efficient computation, hyperbolic space is naturally suited for modeling hierarchical relationships due to its geometric properties. Building upon these observations, this paper proposes a unified framework that jointly optimizes image registration and fusion through a dual-space architecture. This architecture synergistically combines the local fidelity of Euclidean geometry with the hierarchical modeling capability of hyperbolic geometry to enhance multimodal representation learning. Specifically, this paper introduces Hyperbolic Coupled Contrastive Learning Optimization (HCCLO), which aligns and optimizes the hierarchical structures of infrared and visible embeddings in hyperbolic space. Moreover, this paper designs a task-adaptive dual-space features fusion mechanism, which dynamically balances and fuses Euclidean local features with hyperbolic hierarchical representations, thereby improving adaptability for downstream tasks. Extensive experiments on misaligned multimodal datasets demonstrate that our method achieves state-of-the-art performance, while effectively capturing both spatial dependencies and hierarchical semantics.

A Hybrid Space Model for Misaligned Multi-modality Image Fusion

In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Further, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features for better 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.

Content not yet available

Next from AAAI 2026

Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES