Singapore

In this work, we introduce a novel high-fidelity full-head 3D avatar generation method from a single image, regardless of perspective, style, expression, or accessories. Prior works often fail to preserve consistent head geometry and facial details, primarily due to their limited capacity in modeling fine-grained facial textures and maintaining identity information. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring high-quality facial texture details. To further improve performance, we propose a novel multi-view diffusion named ID-TS diffusion model, which integrate identity and expression information into the two-stage multi-view diffusion process. The low-resolution stage ensures structural consistency of heads across multiple views, while the high-resolution stage preserves facial detail fidelity and coherence. Finally, we propose an enhanced feed-forward Gaussian avatar reconstruction method that optimizes the network on multi-view images of each single subject, significantly improving 3D facial texture details. Extensive experiments show that our method demonstrates robust performance across challenging scenarios, while showcasing broad applicability across numerous downstream tasks.

AAAI 2026

High-Quality Full-Head 3D Avatar Generation from Any Single Portrait Image

cv: diffusion models for vision

cv: 3d computer vision

cv: applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Semi-supervised partial label learning (SSPLL) aims to improve the generalization performance of partial label (PL) classifiers by effectively leveraging unlabeled data. Nevertheless, the inherent ambiguity in supervision, where the ground-truth label of a PL example is hidden within a set of candidate labels, poses significant challenges. The presence of false positive labels potentially misleads model's judgment, resulting in pronounced confirmation bias. To address these issues, we propose a novel approach named CODUAL, which jointly learns a pair of dual representations for each instance: the predictive class distribution and the low-dimensional embedding. The dual representations interact and progress collaboratively during training. On one hand, in the embedding space the class prototypes are derived via solving a tailored empirical distance minimization problem and employed to smooth the pseudo-targets of unlabeled instances. On the other hand, the refined class distributions regularize the embedding space via encouraging instances with similar pseudo-targets to exhibit similar embeddings. Through an in-depth analysis, we provide-to the best of our knowledge-the first theoretical explanation of how collaborative dual representations facilitate more effective use of unlabeled data for disambiguation. Extensive experiments over benchmark datasets validate the superiority of our proposed approach.

Collaborative Dual Representations for Semi-Supervised Partial Label Learning

Continuous learning constitutes a fundamental capability of artificial intelligence systems, enabling them to incrementally assimilate novel information without succumbing to catastrophic forgetting. Recent research has leveraged Pre-Trained Models (PTMs) to enhance continual learning efficacy. Nevertheless, prevailing methodologies typically depend on a singular pre-trained backbone and freeze all pre-trained parameters to mitigate network forgetting, thereby constraining adaptability to emerging tasks. In this study, we introduce an innovative PTM-based framework featuring a Dual-Representation Backbone Architecture (DRBA), which integrates both invariant and evolved representation networks to concurrently capture static and dynamic features. Building upon DRBA, we propose an Adaptive and Expandable Mixture Model (AEMM) that incrementally incorporates new expert modules with minimal parameter overhead to accommodate the learning of each novel task. To further augment adaptability, we develop a Dynamic Adaptive Representation Fusion Mechanism (DARFM) that processes outputs from both representation networks and autonomously generates data-driven adaptive weights, optimizing the contribution of each representation. This mechanism yields an adaptive, semantically enriched composite representation, thereby maximizing positive knowledge transfer. Additionally, we propose a Dynamic Knowledge Calibration Mechanism (DKCM), comprising prediction and representation calibration processes, to ensure consistency in both predictions and feature representations. This approach achieves a balance between stability and plasticity, even when learning complex datasets. Empirical evaluations substantiate that the proposed approach attains state-of-the-art performance.

Learning Adaptive and Expandable Mixture Model for Continual Learning

3D human avatar animation aims at transforming a human avatar from an arbitrary initial pose to a specified target pose using deformation algorithms. Existing approaches typically divide this task into two stages: canonical template construction and target pose deformation. However, current template construction methods demand extensive skeletal rigging and often produce artifacts in contact regions. Moreover, target pose deformation suffers from structural distortions caused by Linear Blend Skinning (LBS), which significantly undermines animation realism. To address these problems, we propose a unified learning-based framework to address both challenges in two phases. For the former phase, to overcome the inefficiencies and artifacts during template construction, we leverage a U-Net architecture that decouples texture and pose information in a feed-forward process, enabling fast generation of a human template. For the latter phase, we propose a data-driven refinement technique that enhances structural integrity. Extensive experiments show that our model delivers consistent performance across diverse poses with an optimal balance between efficiency and quality, surpassing state-of-the-art (SOTA) methods.

FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation

Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present LRGait, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose EMGaitNet, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.

Walking Further: Semantic-Aware Multimodal Gait Recognition Under Long-Range Conditions

Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods.

Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

Saliency maps have become a cornerstone of visual explanation in deep learning, yet there remains no consensus on their intended purpose and their alignment with specific user queries. This fundamental ambiguity undermines both the evaluation and practical utility of explanation methods. In this paper, we introduce the Reference-Frame x Granularity (RFxG) taxonomy—a principled framework that addresses this ambiguity by conceptualizing saliency explanations along two essential axes: the reference-frame axis (distinguishing between pointwise "Why Husky?" and contrastive "Why Husky and not Shih-tzu?" explanations) and the granularity axis (ranging from fine-grained class-level to coarse-grained group-level interpretations, e.g., “Why Husky?” vs. “Why Dog?”). Through this lens, we identify critical limitations in existing evaluation metrics, which predominantly focus on pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To address these gaps, we propose four novel faithfulness metrics that systematically assess explanation quality across both RFxG dimensions. Our comprehensive evaluation framework spans ten state-of-the-art methods, four model architectures, three datasets, and targeted user studies. By suggesting a shift from model-centric to user-intent-driven evaluation, our work provides both the conceptual foundation and practical tools necessary for developing explanations that are not only faithful to model behavior but also meaningfully aligned with human understanding.

Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting deeper and longer reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

Reasoning with Exploration: An Entropy Perspective

Brain-assisted target speaker extraction (TSE) isolates a target speaker's voice from a mixture by leveraging task-specific representations in Electroencephalogram (EEG) signals. However, existing methods rely on fixed interpolation for EEG-audio alignment, introducing redundant computations. They also employ single-path encoders that extract only target-relevant features while neglecting complementary, irrelevant ones, limiting discriminability. To address these limitations, this paper proposes a $\textbf{T}$rainable EEG $\textbf{I}$nterpolation and Structure-sharing $\textbf{D}$ual-path $\textbf{E}$ncoders network (TIDENet). The proposed Trainable EEG Interpolation (TEI) uses a neural network module to leverage cross-sample EEG information during resampling by parameters updating, thereby overcoming the limitations of fixed interpolation. The Structure-sharing Dual-path Encoders (SSDPE) extend existing speech and EEG encoders by introducing dual paths that separately process features relevant and irrelevant to the target speaker and incorporates interactive fusion between them, which enhances the encoder's ability to capture task-relevant information. Experimental results on public datasets demonstrate that TIDENet achieves relative improvements of up to $\textbf{20.47}$%, $\textbf{22.22}$%, $\textbf{2.91}$%, $\textbf{6.20}$%, and $\textbf{15.84}$% in signal-to-distortion ratio (SDR), scale-invariant SDR (SI-SDR), short-time objective intelligibility (STOI), extended STOI (ESTOI), and perceptual evaluation of speech quality (PESQ), respectively, compared to the state-of-the-art. These significant gains validate the effectiveness of the proposed TEI method and SSDPE architecture.

Trainable EEG Interpolation and Structure-Sharing Dual-Path Encoders for Brain-Assisted Target Speaker Extraction

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP’s text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP’s ability to comprehend negated visual descriptions. CLIPGlasses adapts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into the modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains. Source code is included in the supplementary material.

Not Just What’s There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-Tuning

Deep neural networks are susceptible to adversarial examples, which induce incorrect predictions through imperceptible perturbations.
Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies have established a strong correlation between the geometric properties of loss landscapes and the transferability of adversarial examples, demonstrating that flatter loss surfaces consistently yield superior transferability. However, we identify that these methods fail to account for the loss landscape flatness along the path from the current point to local minima, resulting in poor transferability. 
To address this, this paper constructs a novel Path Flatness Attack (PFA) method to significantly enhance the transferability of adversarial examples. Specifically, this paper proposes a novel path flatness indicator that not only evaluates the flatness in local minima regions but also explicitly quantifies the loss surface geometry along the trajectory from the current point to the minimum. Furthermore, we incorporate the path flatness indicator into the attack process, integrating penalties over low-loss points along the path while maximizing the loss function, thereby explicitly flattening the loss landscape. Extensive experiments demonstrate that PFA consistently achieves state-of-the-art attack performance across all experimental settings.

Downloads

Next from AAAI 2026

Collaborative Dual Representations for Semi-Supervised Partial Label Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Collaborative Dual Representations for Semi-Supervised Partial Label Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads