Singapore

Unsupervised domain adaptive pose estimation is a fundamental yet challenging task due to the need to transfer from labeled synthetic data to unlabeled real data. Nevertheless, the underlying pose semantics, which are governed by spatial structure, remain largely consistent across domains. This observation motivates the use of vision-language models, which provide domain-invariant representations that align well with high-level semantic concepts. Motivated by this, we propose CLIP2Pose, a novel framework that leverages the semantic robustness of frozen CLIP encoders to facilitate cross-domain generalization. We first introduce a semantic-driven prompt mechanism that encodes structural priors, domain-specific appearance, and instance-level context into the image representation. This guides the model to focus on semantically meaningful and structurally relevant features. Next, we propose a semantic modulation module that adaptively refines visual features by conditioning them on prompt-derived embeddings, enhancing alignment between semantics and visual patterns. To further bridge the modality and domain gaps, we design a directional alignment loss that encourages consistent structural reasoning across both vision and language representations. Extensive experiments on domain adaptive human body and hand pose benchmarks show that CLIP2Pose achieves state-of-the-art performance.

AAAI 2026

CLIP2Pose: Frozen CLIP as Semantic Guide for Domain Adaptive Pose Estimation

vision-language models

prompt learning

pose estimation

unsupervised domain adaptation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named \textit{Zero-to-Hero}, which focuses on reference-based video editing by disentangling the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across the other frames in the video.
To achieve accurate appearance propagation, in the first stage of \textit{Zero-to-Hero}, we leverage correspondences within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. This offers a solid \underline{ZERO}-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with unknown blurring and color-missing issues. Following the Zero-Stage, our Hero-Stage \underline{H}olistically learns a conditional generative model for vid\underline{E}o \underline{R}est\underline{O}ration.
To accurately evaluate appearance consistency, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB.

Zero-to-Hero: Empowering Video Appearance Transfer with Zero-Shot Initialization and Holistic Restoration

Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features. We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios—even surpassing supervised techniques.

Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Despite significant advancements in image generation using advanced generative frameworks, cross-image integration of content and style remains a key challenge. Current generative models, while powerful, frequently depend on vague textual prompts to define styles—creating difficulties in balancing content semantics and style preservation. We propose a novel framework that utilizes customized models to learn style representations. It enhances content preservation through cross-model feature and attention modulation, leveraging the inherent semantic consistency across models. Additionally, we introduce fixed feature and adaptive attention fusion to achieve the desired balance between content and style. We further develop spatial (mask-guided localized) and temporal (multi-style compositional) multi-model combinations, enabling flexible fusion of models and styles. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in balancing content preservation and stylistic coherence.

An Efficient and Harmonized Framework for Balanced Cross-Domain Feature Integration

Flow Matching (FM) is an efficient generative modeling framework, but aligning it with human preferences remains underexplored.~Although applying Direct Preference Optimization (DPO) to diffusion models has yielded improvements, directly extending DPO-like methods to FM poses three challenges: 1) Incompatibility with ODE-based models, 2) Heavy computational cost from full model fine-tuning, and 3) Reliance on reference model quality. To address these limitations, we propose Preference Classifier for Flow Matching (PC-Flow), a novel reference-free preference alignment framework. Specifically, we reinterpret FM’s deterministic ODE as an equivalent SDE to enable DPO-style learning. Then, we introduce a lightweight classifier to model relative preferences exclusively. This approach decouples alignment from the generative model, eliminating the need for costly fine-tuning or a reference model. Theoretically, PC-Flow guarantees consistent preference-guided distribution evolution, achieves a DPO-equivalent objective without a reference model, and progressively steers generation toward preferred outputs. Experiments show that PC-Flow achieves DPO-level alignment with significantly lower training costs.

PC-Flow: Preference Alignment in Flow Matching via Classifier

Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules.

TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models

Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H\&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs residual cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing robustness and representational capacity. This unified architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion’s potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.

HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

Context-based offline meta-reinforcement learning (meta-RL) is a paradigm that integrates meta-learning with offline reinforcement learning. It learns a strategy to extract task-specific contexts from trajectories of meta-training tasks and leverages this strategy for adapting to unseen target tasks. However, existing methods struggle to generate generalizable contexts for adaptations due to context shift, which arises from the behavior policy overfitting to offline data. We argue that leveraging the internal relationships among tasks, rather than treating each task in isolation, is crucial for mitigating the impact of context shift. Hence, we propose a framework called cross-task contexts for improving generalization in meta-RL (CTMRL). Specifically, we design a context quantization variational auto-encoder (CQ-VAE), which clusters task-specific contexts of meta-training tasks into discrete codes based on the internal relationships among tasks. Cross-task contexts are constructed with these codes, reflecting shared information across similar tasks. These cross-task contexts not only serve as high-level structures to capture similarity across tasks but also provide a foundation for hard contrastive learning that enhances the distinguishability of similar yet distinct tasks, thereby improving the generalization of contexts and facilitating adaptation to unseen target tasks. The evaluation in meta-environments confirms the performance advantage of CTMRL over existing methods.

Improving Generalization in Offline Meta-Reinforcement Learning via Cross-task Contexts

Multimodal data is typically collected through heterogeneous sensors and processing pipelines. However, due to variations in acquisition environments, device capabilities, and feature extraction methods, such data often suffers from incompleteness and inconsistent quality across modalities. To address these challenges, prior studies have explored modality selection and data completion strategies to improve information fusion. Nevertheless, these approaches face two main limitations: (1) they struggle to simultaneously ensure computational efficiency for large-scale graph data and maintain structural and semantic consistency across heterogeneous modality graphs; and (2) most of them operate at the modality level and fail to capture fine-grained, sample-specific quality variations.

To overcome these issues, we propose a novel clustering framework, Sample Weighted Incomplete Multimodal Clustering Based on Graph Coarsening Label Extraction (IMC-GCSW). The proposed method introduces a graph coarsening-based label extraction strategy. It significantly reduces the computational cost of multimodal graph processing, while preserving key node information and local topological structures. Furthermore, a quality-aware sample weighting strategy is designed to enable fine-grained modeling of modality-specific data quality, allowing the model to dynamically suppress the influence of low-quality modalities on individual samples. Experiments on both general-purpose datasets and the Fructus Aurantii Disease and Pest Datasets demonstrate that the proposed method exhibits superior performance and strong adaptability in handling multimodal data with incompleteness and quality inconsistency.

Sample Weighted Incomplete Multimodal Clustering Based on Graph Coarsening Label Extraction

Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Early overlap-based methods focus on textual matches between predicted and ground-truth entities but miss fine-grained clinical details (e.g., anatomical location, severity). Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM-based approaches further lack interpretable reasoning steps, making it hard to assess or trust their behavior in safety-critical settings. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = 0.69 for Rexval dataset and Kendall coefficient = 0.45 for RadEvalX dataset).

Content not yet available

Next from AAAI 2026

Zero-to-Hero: Empowering Video Appearance Transfer with Zero-Shot Initialization and Holistic Restoration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES