Singapore

Multimodal learning has shown significant superiority on various tasks by integrating multiple modalities.
However, the interdependencies among modalities increase the susceptibility of multimodal models to adversarial attacks.
Existing methods mainly focus on attacks on specific modalities or indiscriminately attack all modalities. 
In this paper, we find that these approaches ignore the differences between modalities in their contribution to final robustness, resulting in suboptimal robustness performance.
To bridge this gap, we introduce \textbf{V}ulnerability-\textbf{A}ware \textbf{R}obust \textbf{M}ultimodal \textbf{A}dversarial \textbf{T}raining (\texttt{VARMAT}), a probe-in-training adversarial training method that improves multimodal robustness by identifying the vulnerability of each modality.
To be specific, \texttt{VARMAT} first explicitly quantifies the vulnerability of each modality, grounded in a first-order approximation of the attack objective (Probe). Then, we propose a targeted regularization term that penalizes modalities with high vulnerability, guiding robust learning while maintaining task accuracy (Training).
We demonstrate the enhanced robustness of our method across multiple multimodal datasets involving diverse modalities.
Finally, we achieve $\{12.73\%, 22.21\%, 11.19\%\}$ robustness improvement on three multimodal datasets, revealing a significant blind spot in multimodal adversarial training.

AAAI 2026

Vulnerability-Aware Robust Multimodal Adversarial Training

adversarial learning

robustness

security

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Label errors can significantly degrade model performance, making effective mechanisms crucial. Active error correction (AEC) addresses this by prioritizing data points for human re-labeling where corrections are expected to have significant impact. We extend AEC to distributed collaborative learning, where clients hold local data and a central server allocates labeling resources. Existing AEC methods assume centralized access and do not generalize to distributed settings. To overcome this, we use neural network weight gradients from client updates as proxies for local data and apply a Gaussian process in gradient space to strategically select clients for correction. Our method identifies gradient inconsistencies and encourages diversity through a computationally efficient rank-one Cholesky update. Experiments on eight benchmark datasets demonstrate the effectiveness of our approach.

Client-level Active Error Correction in Distributed Learning

Text-to-video models have demonstrated impressive capabilities in producing diverse video content, yet often lack fine-grained control over motion. We introduce MotionFlow, a novel, training-free framework for motion transfer in pre-trained video diffusion models. MotionFlow uniquely leverages cross-attention maps by guiding a test-time optimization of latent representations to align the generated video's attention patterns with those extracted from a source motion. This approach enables the capture and manipulation of complex spatial and temporal dynamics for seamless motion transfer across diverse contexts. Unlike methods relying on direct attention map replacement, which can introduce artifacts, or those requiring model-specific training, MotionFlow operates solely at test-time, robustly handling significant scene and appearance alterations. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing methods in motion fidelity, temporal consistency, and versatility, even during drastic scene transformations.

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored.In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects. Code and sample data are available at \url{https://anonymous.4open.science/r/DanceHA}.

DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for its accurate recognition, enhanced facial privacy protection and robustness to various attacks. However, there are limited studies to further verify the privacy risks by extracting realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that the reconstructed faces can be used for accessing other real-word FR systems.
Besides, the proposed method shows the robustness in reconstructing faces from partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage.

Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance–music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches.

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions. Our approach addresses these challenges by: (i) reformulating the problem of gradient preconditioning as that of adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner using the framework of natural evolution strategies, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance. Experiments on standard LLM benchmarks show that our method outperforms state-of-the-art ZO methods by achieving higher accuracy and faster convergence, while cutting peak memory usage by up to 27.3% compared with MeZO-Adam.

Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-tuning

Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability, only working for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high-quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to improve sensitivity to the region of interest. Our experiments show that MAISI-v2 can achieve state-of-the-art image quality with 33× acceleration for latent diffusion models. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.

MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss

While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly impose topological consistency. Conformable Convolution learns adaptive kernel offsets that focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Unlike existing approaches that are merely aware of topology, our method explicitly constrains the learning process to ensure topological correctness. The proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. The results on three diverse datasets demonstrate that our framework effectively preserves the topology both quantitatively and qualitatively. The source code of this work will be published upon its acceptance.

Conformable Convolution for Topologically Constrained Learning of Complex Anatomical Structures

As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a "self-consuming loop" that can lead to training instability or *model collapse*. Common strategies to address the issue---such as accumulating historical training data or injecting fresh real data---either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose *Latent Space Filtering* (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.

Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Guided Perturbation Sensitivity (GPS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-$k$ critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 88\% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we demonstrate that gradient-based ranking significantly outperforms attention and random selection approaches, with identification quality strongly correlating with detection performance for word-level attacks ($\rho = 0.65$). GPS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Downloads

Next from AAAI 2026

Client-level Active Error Correction in Distributed Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Client-level Active Error Correction in Distributed Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads