Singapore

Knowledge distillation based on large vision-language models (VLMs) has recently emerged as a significant solution to transfer knowledge from the source domain to the target domain in unsupervised domain adaptation (UDA) tasks. However, existing methods employ a two-stage training pipeline, which not only complicates the training procedure but also lacks interactions between the source and target domains, severely hindering real-time cross-domain knowledge transfer. To address these challenges, we propose \textbf{E}nd-to-End \textbf{K}nowledge \textbf{D}istillation for UD\textbf{A} with large VLMs (termed as EKDA). (1) EKDA employs a lightweight prompt learning mechanism to first embed the knowledge from the source domain into VLMs, and then simultaneously utilize the image encoder and text encoder of VLMs to perform knowledge distillation on the target domain, significantly reducing the domain gap. (2) EKDA designs a teacher-student alternating training strategy to implement real-time collaborative interactions across domains, enabling an end-to-end paradigm to provide accurate source domain-aware supervision for the target domain. We conduct extensive experiments on 4 widely recognized benchmark datasets including Office-31, Office-Home, VisDA-2017, and Mini-DomainNet. Experimental results demonstrate that EKDA achieves significant performance improvement over the state-of-the-art UDA approaches, while maintaining a much lower model complexity. Take Office-Home for example, EKDA has gained at least 2.7\% performance improvement while reducing the learnable parameters by over 80\% compared with the strongest baselines. The code is available at \textcolor{blue}{https://anonymous.4open.science/r/EKDA}.

AAAI 2026

End-to-End Knowledge Distillation for Unsupervised Domain Adaptation with Large Vision-language Models

large multimodal models (lmms)

representation learning for vision

unsupervised & self-supervised learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

3D semantic segmentation serves as a fundamental component in many applications, such as autonomous driving and medical image analysis. Although recent methods have advanced the field, adapting these methods to new environments or object categories without extensive retraining remains a significant challenge. To address this, we introduce xMHashSeg, a novel training-free cross-modal LiDAR semantic segmentation framework. xMHashSeg leverages foundation models and non-parametric network to extract features from 2D images and 3D point clouds, subsequently integrating these features through hash learning. Specifically, We develop point-SANN, a novel self-adaption non-parametric network that can extract robust 3D features from raw point clouds, while 2D features are directly extracted through the foundation model DINOv2. To reconcile inconsistencies across different modals, we introduce a Hash Code Learning Module that projects all information into a common hash space, learning a consistent hash code that enhances feature integration. Additionally, depth maps are utilized as an intermediary form between 2D and 3D data to facilitate convergence during hash code learning. Our experimental results on various multi-modality datasets demonstrate that xMHashSeg outperforms zero-shot learning approaches and achieve performance close to that of unsupervised domain adaptation and test-time adaptation methods, without requiring any annotations or additional training.

xMHashSeg: Cross-modal Hash Learning for Training-free Unsupervised LiDAR Semantic Segmentation

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models' event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (**TSPO**), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO's training, we propose a long video training data construction pipeline with comprehensive temporal data for general video understanding and video Needle-in-a-Haystack data for long-range temporal localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

With the widespread deployment of deep learning models in multi-party collaborative scenarios, the issues of secure model access control and intellectual property (IP) protection have become increasingly critical. To address the limitations of existing methods that lack proactive defense mechanisms in such settings, this paper introduces a novel paradigm Consensus Learning which enables fine-grained control over model execution permissions via a multi-party joint authorization mechanism. Building on this, we propose the Collaborative Perturbation Trigger Method (CPTM), which allows participating parties to collaboratively generate perturbation-based trigger data that embed identity features. The model can only be activated using the collectively constructed trigger, enforcing tightly bound access control without modifying the model architecture. Extensive experiments on CIFAR-10, CIFAR-100, MNIST, and Face-LFW datasets demonstrate that the proposed method maintains prediction accuracy within 2% of the baseline unprotected models on authorized data. In contrast, under unauthorized or adversarial inputs, model accuracy drops below 10%, showcasing strong access control capabilities and robustness. This study offers a novel direction for building scalable, robust, and proactively protected deep learning models in multi-party collaborative environments.

Consensus Learning with Multi-Party Perturbation Triggers for Secure Model Access

Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document’s KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from missing cross-attention, it requires retaining all KV Cache throughout, failing to reduce memory overhead. This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache. Specifically, SamKV takes into account the complementary information of other contexts when sparsifying one context, and then locally recomputes the sparsified information. Experiments demonstrate that our method compresses sequence length to 15% without accuracy degradation compared with full-recomputation baselines, significantly boosting throughput in multi-context RAG scenarios.

Sparse Attention Across Multiple-Context KV Cache

Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatiotemporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaTFormer, a causal Temporal Transformer that explicitly models causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaTFormer introduces a novel Reciprocal Delayed Fusion (RDF) mechanism for precise temporal alignment of internal and external feature streams, a Counterfactual Residual Encoding (CRE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent temporal representations. We evaluate the proposed CaTFormer on the public Brain4Cars dataset, and it achieves state-of-the-art performance. It effectively captures complex causal temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the **m**ulti-view **s**piking **t**emporal-**a**ware **s**elf-**a**ttention (**MSTASA**) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose **SpikCommander**, a fully spike-driven transformer architecture that integrates MSTASA with a **s**piking **c**ontextual **r**efinement channel MLP (**SCR-MLP**) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Class-incremental learning (CIL) has recently gained great attention in the field of time series classification.
Existing methods based on knowledge distillation exhibit impressive ability to preserve prior knowledge and overcome catastrophic forgetting, however, their effectiveness faces a major challenge posed by time series data.
Since temporal data are more susceptible to sensor errors and electronic noise,
the distillation process may be negatively affected by noisy knowledge transfer.
To address this issue, we propose a novel confidence-guided mask distillation (CMD) framework,
to prevent the noisy inheritance during distillation. 
The core of CMD lies in a dynamic masking mechanism guided by prediction confidence, 
capable of allocating higher weights to high-confidence time series and substantially suppressing
the influence of low-confidence ones.
Additionally, different from prior work passing a set of feature prototypes to the classifier simply,
we develop prototype-guided contrastive learning (PCL) to alleviate the classifier bias on new
classes, through extra contrastive constraints to push away the feature distributions of old feature prototypes from those of new classes features.
Extensive experiments on three time-series datasets demonstrate that our CDMD method significantly outperforms other replay-free CIL approaches in raising average accuracy, as well as decreasing forgetting rate.

Time Series Class-Incremental Learning via Confidence-guided Mask Distillation and Prototype-guided Contrastive Learning

Real-world control systems require policies that are not only high-performing but also interpretable and robust. A promising direction toward this goal is model-based control, which learns system dynamics and cost functions from historical data and then uses these models to inform decision-making. Building on this paradigm, we introduce DiffOP, a novel framework for learning optimization-based control policies defined implicitly through optimization control problems.
Without relying on value function approximation, DiffOP jointly learns the cost and dynamics models and directly optimizes the actual control costs using policy gradients.
To enable this, we derive analytical policy gradients by applying implicit differentiation to the underlying optimization problem and integrating it with the standard policy gradient framework.
Under standard regularity conditions, we establish that DiffOP converges to an $\epsilon$-stationary point within $\mathcal{O}(\epsilon^{-1})$ iterations.
We demonstrate the effectiveness of DiffOP through experiments on nonlinear control tasks and power system voltage control with constraints.

DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

Composed Image Retrieval (CIR) combines the reference image with text to retrieve the intended target image. Recently, zero-shot CIR has gained significant attention by eliminating the need for labeled triplets required in supervised CIR. However, it inevitably demands additional training corpus, storage, and computational resources, limiting its applicability in real-world scenarios. Inspired by advancements in Test-Time Adaptation (TTA), we propose a Test-Time CIR setting named TT-CIR, which aims to efficiently adapt models to unlabeled test samples while reducing resource consumption. Within the TT-CIR setting, we identify that naively introducing existing TTA methods (e.g., reward-based) into CIR faces two vital challenges: 1) Modification-restricted reward pool, which limits the exploration of semantically relevant candidate rewards; 2) Conservative knowledge feedback, which inhibits the adaptability of rewards to the current data distribution. To address these challenges, we propose a test-time reinforcement learning framework that integrates a Counterfactual-guided Multinomial Sampling (CMS) strategy and a Duplex Rewards Modeling (DRM) module. The CMS explores a candidate reward pool that is both visually similar and semantically relevant to the given query, while the DRM generates stable and adaptive duplex rewards to guide model adaptation. Extensive experiments demonstrate the superiority and adaptability of our method over existing approaches. Code is available in the supplementary materials.

Duplex Rewards Optimization for Test-Time Composed Image Retrieval

Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: 
(i) The mask decoder's attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100$\times$), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. 
(ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention.
To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment.
Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. 
Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. 
Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder.
Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages.
For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7\% higher mAP than the baseline in instance segmentation task.

Downloads

Next from AAAI 2026

xMHashSeg: Cross-modal Hash Learning for Training-free Unsupervised LiDAR Semantic Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

xMHashSeg: Cross-modal Hash Learning for Training-free Unsupervised LiDAR Semantic Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads