Singapore

Video-based human pose estimation aims to localize keypoints across frames, enabling robust analysis of human motion in applications such as sports, surveillance, and healthcare. However, existing methods rely solely on visual cues, limiting their robustness in complex scenes involving occlusion, motion blur, or poor lighting. In contrast, dual coding theory from psychology suggests that human cognition is inherently multimodal: we learn by integrating visual perception with linguistic context to form structured, semantic understandings of the world. Visual input provides concrete spatiotemporal grounding, while language offers symbolic abstraction that enhances reasoning and generalization. Motivated by this cognitive principle, we present the first framework that explicitly incorporates language as an auxiliary modality to enhance video-based pose estimation. To address the lack of paired video-text datasets, we first employ a Multimodal Large Language Model (MLLM) to generate textual descriptions of human interactions from videos. We then propose a novel coarse-to-fine multimodal alignment pipeline: a cross-modal semantic interaction module establishes initial grounding between spatiotemporal visual features and textual embeddings, while an optimal transport-based feature matching mechanism enforces fine-grained, geometry-aware alignment. This cognitively inspired design enables more accurate and robust pose estimation, especially in visually challenging scenes like occlusion and motion blur. Extensive experiments on three benchmarks confirm that our method consistently outperforms state-of-the-art approaches. Our code is released and included in the supplementary materials.

AAAI 2026

Dual Coding Theory in Action: Language-Assisted Human Pose Estimation in Videos

cv: biometrics

cv: motion & tracking

gesture & pose

cv: video understanding & activity analysis

face

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

User purchase decisions are driven by complex, multi-faceted intentions that evolve across different temporal horizons (e.g., immediate needs, transitional interests, and long-term preferences). However, existing sequential methods often treat user sequences as unified blocks, overlooking the dynamic evolution of user intents at different granularities, while also lacking robustness against prevalent noise in real-world interaction data. This paper proposes Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation (MIMAR-SRec), a framework that models latent user intentions at multiple granularities. Specifically, MIMAR-SRec integrates multi-granularity intent representation across different contextual windows to capture evolving user interests, dual-perspective contrastive learning that aligns user representations with both intent prototypes and cross-user sequences, and intent-similarity adversarial robustness that systematically enhances model stability against interaction, temporal, and preference noise through controlled perturbations. By integrating multi-granularity intent modeling with adversarial training, MIMAR-SRec enables simultaneous fine-grained underlying intent modeling and noise-resistant recommendations. Extensive experiments on four widely used benchmark datasets demonstrate that MIMAR-SRec outperforms state-of-the-art baselines, particularly in long-tail item recommendation and noisy interaction scenarios. Our code is available in the appendix and will be open-sourced upon paper acceptance.

Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation

Multi-modal salient object detection (SOD) shows an improvement over its uni-modal counterpart by exploiting the complementary benefits between modalities. However, this improvement relies on complete multi-modal information, which is difficult to be guaranteed in practice due to sensor failures and transmission errors. To address this issue, we propose a robust multi-modal SOD framework that enhances the adaptability to modality-missing situations, while maintaining comparable performance in modality-complete cases. Nevertheless, flexibly handling modality-missing and modality-complete situations and integrating their corresponding multi-modal features in a unified framework is non-trivial. To this end, we achieve this framework by designing a Cascaded Mixture-of-Experts (CMoE) network that sequentially incorporates missing-aware and multi-modal MoE. Specifically, the missing-aware MoE introduces zero, copy, and alter experts with a soft router to adaptively reconstruct feature representations for both missing and non-missing modalities, assisted by a expert modulation loss that guides the router to modulate the weights of different experts according to missing conditions. The multi-modal MoE introduces two homogeneous uni-modal experts that separately learn modality-specific knowledge tailored for different modalities and dynamically combines their output through the soft router. The cascaded architecture fully empowers CMoE with the flexibility across varying input cases. Extensive experiments on RGB-D and RGB-T SOD datasets, with both modality-missing and modality-complete settings, demonstrate the effectiveness of the proposed method. Code and models will be made publicly available.

Taming Cascaded Mixture-of-Experts for Modality-missing Multi-modal Salient Object Detection

Accurate muscle-mass assessment is crucial for staging and managing sarcopenia, yet existing methods suffer from modality-specific limitations and weak integration of muscle function indicators. To solve these limitations, we propose a Dual-source Features Graph for Sarcopenia Evaluation (DFGSE) to synergize high- and low-energy whole-body Dual-energy X-ray Absorptiometry (DXA) images, local high-energy DXA images, and blood-borne biochemical markers. Specifically, the feature extraction module employs dual-energy feature extraction to disentangle soft-tissue and skeletal cues from low-energy images, while skeleton-aware detection extracts joint features from high-energy images. It yields global and local DXA embeddings, complemented by blood-test representations. In the relevance exploration module, inter- and intra-modality correlations are computed via bilinear transformations to form adjacency matrices for the global, local, and blood modality representations. These matrices seed the Multi-type Multi-relation Graph Convolutional Network (MMGCN) – the core of the relation learning module – which captures both direct and indirect interactions among modalities through relation-aware message passing. Finally, the graph-fused representations are used by a muscle-mass prediction head trained with cross-entropy loss. Experiments on the public MURA dataset and two independent sarcopenia cohorts demonstrate that DFGSE consistently outperforms machine learning and state-of-the-art graph-based methods, in terms of four evaluation metrics for classification task.

Sarcopenia Assessment Model Based on Dual-Source Modal Graph

As large language models (LLMs) continue to improve in reasoning and decision-making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated. We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios. VirtualEnv supports rich agent–environment interactions, including object manipulation, navigation, and adaptive multi-agent collaboration, as well as game-inspired mechanics like escape rooms and procedurally generated environments. We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents using natural language instructions. We integrate large-scale LLMs and vision-language models (VLMs), such as GPT-based models, to generate novel environments and structured tasks from multimodal inputs. Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi-agent coordination. We also describe our methodology for procedural task generation, task validation, and real-time environment control. VirtualEnv is released as an open-source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment.

VirtualEnv: A Platform for Embodied AI Research

Federated Deep Reinforcement Learning (FDRL) aims to enable distributed collaborative training of multiple DRL models while preserving privacy. Existing FDRL methods function in static client environments, but real-world scenarios often involve dynamic state transitions, such as noise, which render static model topologies inadequate and result in biased policy loss. This degrades client performance and leads to suboptimal global policies. To address this challenge, we develop a generic solution, referred to as the self-regulating training framework, which can be seamlessly integrated into existing FDRL approaches to address dynamic state transitions. Specifically, we propose a Sparse Training (ST) method that dynamically sparsifies and adjusts the topology of each model during training to maximize model performance and reduce model complexity. Additionally, we introduce an auxiliary model to adaptively regulate the policy loss of client models, mitigating loss bias and facilitating updates that yield improved returns. Experimental results demonstrate that our method enhances six state-of-the-art (SOTA) FDRL approaches across nine tasks in terms of return.

A Unified Self-Regulating Training Framework for Federated Deep Reinforcement Learning

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. 
The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. 
Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image.
To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. 
Specifically, we introduce a Vision Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information.
To generate an accurate and harmonious visual text image, we further propose the UM Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction.
During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. 
In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training.
Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint’s prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0\% compared to the current SOTA under multiple speaker identities.

Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints

Recent works have evidenced how a sequential fine-tuning (SeqFT) phase of pre-trained vision transformers (ViTs) followed by a classifier refinement process through approximate distributions of class features, offers effective solutions to class incremental learning (CIL). However, this approach suffers from distribution drift due to the sequential optimization of shared backbone parameters, leading to a mismatch between the approximate distributions of previous classes and those of the updated model. This distribution mismatch generally leads to degraded performance in classifier refinement over time. To tackle this issue, we introduce the latent space transition operator, built on which we propose the Sequential Learning with Drift Compensation (SLDC) method. First, the linear SLDC method, which estimates a linear operator, is developed by solving a regularized least-squares problem between pre- and post-optimization features. Hereafter, the weak-nonlinear SLDC method, which assumes that appropriate transition operators are located at the intersection between linear and nonlinear regions, is developed by constructing learnable weak-nonlinear transformations. Finally, in both variants, knowledge distillation (KD) is applied to further mitigate the representation drift. Extensive experiments on CIL benchmarks demonstrate that SLDC significantly enhances the performance of SeqFT. Notably, by combining KD (to reduce representation drift) with SLDC (to counteract distribution drift), SeqFT achieves comparable performance to joint training across all evaluated datasets.

Compensating Distribution Drifts in Continual Learning with Pre-trained Vision Transformers

Multi-modal dataset distillation (DD) condenses large datasets into compact ones that retain task efficacy by capturing correspondence patterns, i.e., shared semantics between paired modalities. However, such patterns rely on cross-modal similarity and cannot be faithfully captured by intra-modal similarity of current unimodal strategies. As a result, current multi-modal DD methods tend to over-concentrate, redundantly encoding similar correspondence patterns and thus limiting generalizability. To this end, we propose a novel multi-modal DD framework to systematically **Pro**mote **Co**rrespondence coverage, i.e., **ProCo**. Initially, we develop a correspondence consistency metric based on cross-modal retrieval distributions to cluster correspondence patterns. These clusters capture the underlying correspondence distribution, enabling ProCo to initialize distilled data with representative patterns while regularizing optimization to promote correspondence representativeness and diversity. Moreover, we employ conditional neural fields for efficient distilled data parameterization, enhancing fine-grained pattern capture while allowing more distilled data under a fixed budget to boost correspondence coverage. Extensive experiments verify that our ProCo achieves superior and elastic budget-efficacy trade-offs, surpassing prior methods by over 15% with 10$\times$ distillation budget reduction, highlighting its real-world practicality.

Correspondence Coverage Matters for Multi-Modal Dataset Distillation

Selecting a subset of promising candidates from a large pool is crucial across various scientific and real-world applications.
\textit{Conformal selection} offers a distribution-free and model-agnostic framework for candidate selection with uncertainty quantification.
While effective in offline settings, its application to online scenarios, where data arrives sequentially, poses challenges.
Notably, conformal selection permits the deselection of previously selected candidates, which is incompatible with applications requiring irreversible selection decisions.
This limitation is particularly evident in resource-intensive sequential processes, such as drug discovery, where advancing a compound to subsequent stages renders reversal impractical.
To address this issue, we extend conformal selection to an online \textit{Accept-to-Reject Changes} (ARC) procedure: non-selected data points can be reconsidered for selection later, and once a candidate is selected, the decision is irreversible.
Specifically, we propose a novel conformal selection method, \textit{Online Conformal Selection with Accept-to-Reject Changes} (dubbed \textbf{OCS-ARC}), which incorporates online Benjamini–Hochberg procedure into the candidate selection process.
We provide theoretical guarantees that OCS-ARC controls the false discovery rate (FDR) at or below the nominal level at any timestep under both i.i.d. and exchangeable data assumptions.
Additionally, we theoretically show that our approach naturally extends to multivariate response settings.
Extensive experiments on synthetic and real-world datasets demonstrate that OCS-ARC significantly improves selection power over the baseline while maintaining valid FDR control across all examined timesteps.

Content not yet available

Next from AAAI 2026

Multi-granularity Intent Modeling with Adversarial Robustness for Sequential Recommendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES