Singapore

While adapting pretrained vision models to downstream dense prediction tasks is widely used, current methods often overlook adaptation efficiency, especially in the context of multi-task learning (MTL). Although parameter-efficient fine-tuning (PEFT) methods can enhance parameter efficiency, broader aspects such as GPU memory and training time efficiency remain underexplored. In this paper, we propose a new paradigm that simultaneously achieves efficiency in Parameters, GPU Memory, and Training Time for Multi-Task Dense Vision Adaptation. Specifically, we propose a dual-branch framework, in which a frozen pretrained backbone serves as the generic main branch, and the proposed Bi-Directional Task Adaptation (BDTA) modules are integrated in parallel to form a task bypass branch that extracts adaptation features required by multiple specific tasks. This adaptation module is lightweight, efficient, and does not require backpropagation through the large pre-trained backbone, thus avoiding resource-intensive gradient computations. Moreover, a Mixture of Task Experts mechanism (MoTE) is further proposed to integrate adaptation features across tasks and scales, thereby obtaining more robust representations tailored for dense prediction tasks. On the PASCAL-Context benchmark, our method achieves over 2× relative performance improvement compared to the best prior multi-task PEFT method, while using only $\sim$30% of the parameters, $\sim$50% of the memory, and $\sim$60% of the training time, demonstrating superior overall adaptation efficiency.

AAAI 2026

Parameter-, Memory-, Time-Efficient Multi-Task Dense Vision Adaptation

ml: transfer

cv: representation learning for vision

cv: applications

domain adaptation

multi-task learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework.

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network efficiently capturing long-range dependencies among features and samples with low computational complexity. RefiDiff bridges the predictive and generative paradigms of imputation, leveraging pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOTA) methods across missing-value settings, demonstrating strong performance in MNAR settings and superior out-of-sample generalization. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.

RefiDiff: Progressive Refinement Diffusion for Efficient Missing Data Imputation

Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks—such as video-language generation—introduces two core challenges: *1) visual token inflation*, where long visual-token sequences restrict context capacity and result in insufficient feedback signals; *2) a lack of process-level supervision*, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization.
We present **UniAPO**: **Uni**fied Multimodal **A**utomated **P**rompt **O**ptimization, the first framework tailored for multimodal APO. 
UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. 
To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization.
UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.

UniAPO: Unified Multimodal Automated Prompt Optimization

Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning.
However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests.
In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better.
To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions.
GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information.
GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions.
Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones.
However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15\% to 73.98\% on GSM-MC.
Our data and code will be released upon acceptance. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

We challenge the assumption that complex instruction-guided segmentation tasks necessitate equally complex and explicit supervision. This paper introduces RISE (Reasoning via Implicit Self-supervised Emergence), a framework that learns intricate compositional reasoning, spanning spatial relations to world knowledge, without a single ground-truth mask. To achieve this, RISE employs reinforcement learning with GRPO guided by a single, strikingly simple reward: the semantic alignment score between the textual instruction and the predicted image region. Our primary discovery is the implicit emergence of a high-quality chain-of-thought process from this minimalist signal. Within a structured format, the model autonomously learns to understand instructions by accessing its latent knowledge, inferring spatial relationships—capabilities inherent in its architecture but unlocked by our simple objective. Remarkably, our emergent reasoning yields highly competitive results: RISE achieves 58.7 gIoU on the ReasonSeg benchmark, on par with methods using geometric rewards. Furthermore, we show extreme data efficiency: a variant trained on only 2,000 ImageNet-label pairs establishes a new state-of-the-art for annotation-free referring segmentation with 73.7 cIoU on RefCOCO, drastically outperforming prior work (46.5).

Reasoning via Implicit Self-supervised Emergence for Instruction Segmentation

Monocular 3D Visual Grounding (Mono3DVG) is an emerging task that locates 3D objects in RGB images using text descriptions with geometric cues. However, existing methods face two key limitations. Firstly, they often over-rely on high-certainty keywords that explicitly identify the target object while neglecting critical spatial descriptions. Secondly, generalized textual features contain both 2D and 3D descriptive information, thereby capturing an additional dimension of details compared to singular 2D or 3D visual features. This characteristic leads to cross-dimensional interference when refining visual features under text guidance. To overcome these challenges, we propose Mono3DVG-EnSD, a novel framework that integrates two key components: the CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) and the Dimension-Decoupled Module (D2M). The CLIP-LCA dynamically masks high-certainty keywords while retaining low-certainty implicit spatial descriptions, thereby forcing the model to develop a deeper understanding of spatial relationships in captions for object localization. Meanwhile, the D2M decouples dimension-specific (2D/3D) textual features from generalized textual features to guide corresponding visual features at same dimension, which mitigates cross-dimensional interference by ensuring dimensionally-consistent cross-modal interactions. Through comprehensive comparisons and ablation studies on the Mono3DRefer dataset, our method achieves state-of-the-art (SOTA) performance across all metrics. Notably, it improves the challenging Far(Acc@0.5) scenario by a significant +13.54\%.

Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

Federated Recommender Systems (FedRecs) leverage federated learning to protect user privacy by retaining data locally.
However, user embeddings in FedRecs often encode sensitive attribute information, rendering them vulnerable to attribute inference attacks. Attribute unlearning has emerged as a promising approach to mitigate this issue. In this paper, we focus on user-level FedRecs, which is a more practical yet challenging setting compared to group-level FedRecs. Adversarial training emerges as the most feasible approach within this context. We identify two key challenges in implementing adversarial training-based attribute unlearning for user-level
FedRecs: i) mitigating training instability caused by user data heterogeneity, and ii) preventing attribute information leakage through gradients. To address these challenges, we propose FedAU$^2$, an attribute unlearning method for user-level FedRecs. For CH1, we propose a adaptive adversarial training strategy, where the training dynamics are adjusted in response to local optimization behavior. For CH2, we propose a dual-stochastic variational autoencoder to perturb the adversarial model, effectively preventing gradient-based information leakage. Extensive experiments on three real-world datasets demonstrate that our proposed FedAU$^2$ achieves superior performance in unlearning effectiveness and recommendation performance compared to existing baselines.

FedAU2: Attribute Unlearning for User-Level Federated Recommender Systems with Adaptive and Robust Adversarial Training

Skill discovery has emerged as a popular route for unsupervised reinforcement learning (URL), offering agents a diverse, reusable set of behaviours learned before any task-specific reward is experienced. However, existing methodologies tend to favour either categorical codes or unimodal skill priors, which simplifies training at the cost of limiting the variety of behaviours they can represent. We introduce \emph{Discovery of Mixture Skills} (DiMS), a URL algorithm that learns a latent Gaussian mixture by training a Gaussian Mixture Variational Autoencoder (GMVAE) in tandem with the unsupervised policy. In DiMS, a hierarchical GMVAE simultaneously discovers clusters of skills, while an auxiliary macro-latent dynamically positions mixture components to prevent mode collapse. A joint loss term combining log-likelihood and curiosity rewards enables simultaneous updates of representation and policy while improving exploration. Experiments on the Unsupervised Reinforcement Learning Benchmark (URLB) show that DiMS consistently outperforms a wide range of state-of-the-art baselines. Ablation studies confirm that the mixture prior is critical to these gains, and that DiMS is robust to alternative exploration bonuses. Overall, our results suggest that Gaussian mixture skill priors offer a compelling foundation for future unsupervised RL.

Discovering Mixture Skills for Unsupervised Reinforcement Learning

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, images, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of modality-complete data and the difficulty of jointly modeling triplet conditions without performance degradation. In this work, we present
HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct an incomplete-yet-complementary dataset for improved data utilization efficiency and training scalability. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies at each stage. In the first stage, to balance the text-following and subject-preservation abilities, we adopt the minimal-invasive image injection strategy. In the second stage, to enhance audio-visual sync, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multi-modal inputs, we progressively incorporate the audio-visual sync task, building on previously acquired capabilities. During inference, for flexible and fine-grained multimodal control, we design a stage-adaptive Classifier-Free Guidance
strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Demo videos can be found in the supplementary materials.

Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Federated Learning (FL) enables privacy-preserving distributed training but remains vulnerable to backdoor attacks. Attackers can embed malicious trigger-label associations into the global model by participating in the aggregation process. Existing defense methods typically defend against backdoor attacks by detecting and filtering malicious updates that deviate from benign ones. However, we find that these defenses fail under domain skew, where differing feature distributions across clients increase update heterogeneity, making it harder to distinguish malicious updates from benign ones. To address this challenge, we propose $\textbf{DoBlock}$, a novel defense that utilizes an aggregatable domain infuser incapable of embedding malicious associations, through federated training to facilitate cross-domain knowledge sharing. Moreover, DoBlock prevents malicious association propagation by isolating local models from aggregation, as local models remain client-specific and rely solely on local data for training. Experiments on five domain skew datasets (Digits, PACS, VLCS, Office-Caltech10, and DomainNet) show that DoBlock maintains attack success rates below 2.5\%, while achieving the highest main task accuracy, demonstrating superior robustness without sacrificing benign performance.

Downloads

Next from AAAI 2026

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads