Singapore

Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning.
Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks.
Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive.
However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server.
To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. 
To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations.
In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity.
Our method is training-free, not relying on additional training resources on either the client or server side.
Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.

AAAI 2026

TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

modal fusion

federated learning

vision-language model

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Dynamic graphs are common in real‑world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the full complexity of temporal evolution. 
They tend to overlook fine‑grained variations in interaction order, struggle with dependencies that span long time horizons, and provide limited modeling of pair‑specific relational dynamics. To address those challenges, we propose Graph2Video, a video‑inspired framework that views the temporal neighborhood of a target link as a sequence of “graph frames”. By stacking temporally ordered subgraph frames into a “graph video”, Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight, plug-and-play, link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state‑of‑the‑art baselines in the link prediction task on most cases. The results highlight that borrowing spatio‑temporal modeling techniques from computer vision provides a principled and effective avenue for advancing dynamic graph learning.

Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems.
In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS’s resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent’s prompt based on its connections and interactions with other agents. 
Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.

ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.

WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-dimensional Annotation

For artificial intelligence to be safely deployed in high-risk domains, it must reliably know its limits. Selective predic- tion, or learning with a reject option, addresses this by en- abling a model to abstain from prediction on inputs it deems unreliable, deferring them to a human expert. While deep en- sembles have emerged as a leading approach for uncertainty estimation, their potential is often squandered by rejection methods that rely on static thresholds applied to the mean prediction. In this paper, we propose to learn a dynamic rejec- tion policy directly from the rich behavioral signals of the en- semble itself. Our framework, DEGRE (Dynamic Ensembles Gating for REjection), is a novel meta-learning approach that trains a lightweight gating network on the ensemble’s con- sensus confidence and its internal disagreement (variance)— to explicitly discriminate between correct and incorrect pre- dictions. Through rigorous evaluation across twelve diverse medical imaging benchmarks (MRI, X-ray, CT), DEGRE sig- nificantly advances selective prediction, achieving an aver- age risk-coverage (AURC) reduction of 68.2% compared to the standard ensemble baseline. By providing a more reli- able method for a model to recognize its own limitations, this learned, adaptive rejection mechanism provides the ro- bust self-awareness necessary for true AI-in-the-loop (AI2L) systems, paving the way for the safe and responsible integra- tion of AI into critical clinical workflows.

DEGRE: Dynamic Gating Ensembles for Trust-Aware Rejection in Medical Image Diagnostics

3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.

Edge-Centric Relational Reasoning for 3D Scene Graph Prediction

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present \textbf{MHA2MLA-VLM}, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. 
Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.

MHA2MLA-VLM: Enabling DeepSeek’s Economical Multi-Head Latent Attention Across Vision-Language Models

Marked Temporal Point Processes (MTPPs) provide a principled framework for modeling asynchronous event sequences by conditioning on the history of past events. However, most existing MTPP models rely on channel-mixing strategies that encode information from different event types into a single, fixed-size latent representation. This entanglement can obscure type-specific dynamics, leading to performance degradation and increased risk of overfitting. In this work, we introduce ITPP, a novel channel-independent architecture for MTPP modeling that decouples event type information using an encoder-decoder framework with an ODE-based backbone. Central to ITPP is a type-aware inverted self-attention mechanism, designed to explicitly model inter-channel correlations among heterogeneous event types. This architecture enhances effectiveness and robustness while reducing overfitting. Comprehensive experiments on multiple real-world and synthetic datasets demonstrate that ITPP consistently outperforms state-of-the-art MTPP models in both predictive accuracy and generalization.

ITPP: Learning Disentangled Event Dynamics in Marked Temporal Point Processes

3D semantic occupancy prediction offers a nuanced representation of the surrounding environment, which is crucial for ensuring the safety of autonomous driving. However, fine-grained scene representations inevitably result in cubic growth in data scale, which imposes substantial demands on model architecture and computational complexity, especially in high-resolution scenarios. Existing approaches for handling high-resolution scenes typically obtain fine-grained features by grid sampling on low-resolution feature map, resulting in limited sparsity and insufficient feature interaction. This paper presents a framework leveraging SParse representation and SCalable feature interaction to address the aforementioned challenges, called SPSC. Specifically, we maintain sparsity by progressively pruning unoccupied queries during the coarse-to-fine process, thereby reducing the scale of data that the model needs to handle. Subsequently, we introduce query serialization, which transforms queries into an ordered sequence while preserving their spatial structure, This enables fine-grained feature interaction while maintaining linear computational complexity and a larger receptive field. Without complex architectural designs, SPSC significantly outperforms SOTA approaches, relatively enhances the mIoU by 12.0%, 11.0% and 4.8% on nuScenes-Occupancy dataset under the muli-modal, LiDAR and camera settings, respectively.

SPSC: Sparse and Scalable Multi-Modal 3D Occupancy Prediction for Autonomous Driving

Existing gaze estimation models often struggle to generalize to unseen users, primarily due to significant variations in individual appearance. Empirical observations reveal that performance improves when the visual appearance of test subjects closely resembles that of training subjects. Motivated by this, we propose a generalizable gaze estimation framework MoEGaze based on the Mixture of Experts (MoE) architecture. During training, the model extracts appearance features from facial images and uses them to route samples to specialized gaze expert networks, each tailored to a specific subset of appearances. Rather than directly predicting gaze, each expert outputs intermediate gaze features, which are dynamically aggregated according to the input appearance and then mapped to gaze prediction. This dynamic routing design enables the model to effectively adapt to users with diverse appearances, while also facilitating easier training on sub-datasets with smaller appearance variations. Extensive experiments demonstrate that our method achieves superior cross-domain performance compared to existing approaches, with an average improvement of 27.6% across four cross-domain
metrics over the baseline. Furthermore, MoEGaze surpasses baselines trained on the full dataset while requiring only 10% of the training data.

MoEGaze: A Mixture of Experts Approach for Generalizable Gaze Estimation

Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set.
Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of a major Chinese short video platform's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45\%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9\%, and improved the Image User Penetration Rate (IUPR) by 0.1054\%.

Downloads

Next from AAAI 2026

Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads