Singapore

Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7\% improvement in perception accuracy, a 7.2\% reduction in prediction error, and a 33.2\% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.

AAAI 2026

UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

cv: vision for robotics & autonomous driving

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. 
Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. 
To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step.
POSE employs a carefully designed two-phase process to distill video models:
(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories.
(ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos.
For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. 
Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15\% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

The automation of diagram generation has gained significant attention in recent years.In previous work on automated diagram generation, most studies focused on generating diagrams from natural language descriptions. Many of these methods produced diagrams that were not user-friendly for editing, lacking drag-and-drop functionality. This paper proposes a novel task of generating editable, high-fidelity diagrams from either text descriptions or raster images (non-editable images), and it may be the first to introduce the tasks of diagram restoration and style transfer in this context. To address these tasks, we created the Diagram-mxGraph dataset, covering three main tasks: diagram restoration, text-to-diagram generation, and diagram style transfer. We introduced two key innovations: Fine-grained Adaptive Background Suppression (FABS) and Component-Aware Adaptive Loss (CAAL). By leveraging the capabilities of pre-trained Vision Transformers (ViTs) and the Diagram Adapter module, we effectively align the features of diagrams with a Large Language Model (LLM) to generate mxGraph format textual descriptions, ultimately producing editable diagrams.

DiagramGPT-Llama3:Enabling Editable, High-Fidelity Diagram Generation with Vision Large Language Models

Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.

Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval

The rise of Vision Transformers (ViTs) as cornerstone models in safety-critical applications like autonomous driving and medical diagnosis has shifted the focus from pure accuracy to verifiable trustworthiness. However, the very mechanisms used to explain these models, their internal attention maps, are themselves vulnerable. This creates a critical "trust gap," as the model's apparent reasoning can be maliciously manipulated. To systematically investigate this vulnerability, we introduce A-SAGE (Attention-based Steering Adversarial Generation by Corrupting Explanations), a dual-objective attack framework that forces a model to misclassify an input while simultaneously corrupting its internal attention patterns to generate a misleading explanation. A-SAGE achieves this by optimizing a unified loss that combines a standard classification objective with two explanation-specific terms: an attention entropy loss to diffuse the model's focus and an \textit{attention map distortion loss} to steer the corrupted explanation towards a desired target. Our primary finding is A-SAGE's exceptional black-box transferability. Using a CaiT-S as a white-box surrogate, adversarial examples generated with imperceptible perturbations ($L_{\infty}\leq16/255$) achieve attack success rates of 79.4\% on ViT-B, 49.7\% on ResNet-50, and over 81.5\% on other transformers (DeiT-B,TNT-S). Crucially, these successful attacks do not merely destroy the explanation; they generate a coherent but false attention map that deceptively "justifies" the wrong prediction. These results reveal a systemic vulnerability in the core reasoning of modern foundation models, establishing A-SAGE as a critical benchmark for auditing the robustness of AI explainability.

Manipulating the Mind’s Eye: A-SAGE, the Attention-Based Attack on ViT Explainability

Federated recommendations (FRs), facilitating multiple local clients to collectively learn a global model without disclosing user private data, have emerged as a prevalent on-device service. In conventional FRs, a dominant paradigm is to utilize discrete identities to represent clients and items, which are then mapped to domain-specific embeddings to participate in model training. Despite considerable performance, we reveal three inherent limitations that can not be ignored in federated settings, i.e., non-transferability across domains, ineffectiveness in cold-start settings, and potential privacy violations during federated training. To this end, we propose a transferable federated recommendation model, TransFR, which delicately incorporates the general capabilities empowered by pre-trained models and the personalized abilities by fine-tuning local private data. Specifically, it first learns domain-agnostic representations of items by exploiting pre-trained models with public textual corpora. To tailor for FR tasks, we further introduce efficient federated adapter-tuning and test-time adaptation mechanisms, which facilitate personalized local adapters for each client by fitting their private data distributions. We theoretically prove the advantages of incorporating adapter tuning in FRs regarding both effectiveness and privacy. Through extensive experiments, we show that our TransFR model surpasses several state-of-the-art FRs on transferability. Our source code is available at attached Supplementary Material.

TransFR: Transferable Federated Recommendation with Adapter Tuning on Pre-trained Language Models

Leveraging intrinsic data priors is critical for effective data recovery. However, existing approaches often struggle to simultaneously achieve theoretical guarantees, strong performance, and computational efficiency. In this paper, we introduce a novel \emph{Representative Coefficient Correlated Total Variation} (RCCTV) regularizer that captures the recently observed low-rank and local smoothness properties of the representative coefficient tensor derived from a low-rank decomposition. RCCTV offers three key advantages: (1) it operates on a compact representative coefficient image significantly smaller than the original data, enabling highly efficient optimization; (2) it jointly enforces low-rankness and spatial smoothness through a single regularizer, eliminating the need for trade-off parameters; and (3) when integrated into a robust PCA framework (RCCTV-RPCA), it admits provable exact recovery under mild conditions. To solve the resulting model, we develop an efficient ADMM-based algorithm accelerated via fast Fourier transform. Extensive experiments on both synthetic and real-world datasets demonstrate that RCCTV-RPCA achieves state-of-the-art accuracy with significantly reduced runtime.

Fast Guaranteed Robust Local-Smooth Principal Component Separation

Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes—such as age, gender, weight, and height—which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating attribute-aware motion aligned with the user's text and attribute inputs. For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. Extensive experiments validate our model's effectiveness.

Generating Attribute-Aware Human Motions from Textual Prompt

Text-to-MIDI generation offers editable and hierarchical control over symbolic music generation. Previous approaches either convert text into a limited set of musical attributes and generate music based on these attributes, which limits semantic controllability, or use end-to-end models that map text directly to music without deeply aligning the features of both modalities, often resulting in a lack of structural coherence and mismatches in key, meter, and tempo. We propose MIDILM, which addresses these limitations by employing text conditioning with a dual-path decoder that processes textual and musical information through separate feedforward paths following a shared masked self-attention mechanism. On the MidiCaps benchmark, MIDILM outperformed the strongest baseline, with relative improvements ranging from 6.07\% on CLAP to 144.77\% on TB across semantic alignment and structural metrics. These gains confirm its ability to enhance both semantic controllability and structural coherence. Collectively, we expect that MIDILM will serve as a useful reference framework for future investigations into controllable and structurally faithful cross-modal music generation.

MIDILM: A Dual-Path Model for Controllable Text-to-MIDI Generation

Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose the first method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by $4.1$ and improves SSIM by up to $0.116$ on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset. Our code is available at: https://anonymous.4open.science/r/SLDDM.

Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation

Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR’s single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with O((ST)^2) or O(ST) complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to O(S + T) and memory complexity to O(max(S, T)), enabling global context modeling at 512^2 resolution and beyond, operating directly on dense feature maps with a \textbf{patch-free} design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for 512^2 video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18\%, paving the way for safer and more anticipatory embodied agents.

Downloads

Next from AAAI 2026

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads