Singapore

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks—particularly those exploiting both modalities—remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce \textit{Multi-Modal Adversarial Synergy (MMAS)}, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained Universal Adversarial Perturbation (UAP) for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation, bounded by an $\ell_{\infty}$-norm, leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an $\ell_2$-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations’ gradient directions, enhancing their synergistic impact and transferability across tasks and models.
Extensive experiments are conducted to verify the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs, 
spanning a spectrum of tasks on various datasets, all achieved without delving into the details of the model structures.

AAAI 2026

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

vision and language; machine learning; adversarial attacks & robustness

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Domain generalization (DG) and algorithmic fairness are two key challenges in machine learning. However, most DG methods focus solely on minimizing expected risk in the unseen target domain, without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we propose a practical method that solves the FairDG problem through Pareto optimization. Experiments on real-world vision and language datasets show that our method achieves superior utility–fairness trade-offs compared to existing approaches.

Fair Domain Generalization: An Information-Theoretic View

Images are typically sampled on a uniform grid, despite the fact that visual information is distributed non-uniformly,leading inefficiency in processing.To reduce computation while retaining important content,recent studies have proposed predictive downsampling methods that adaptively downsample images based on predicted pixel importance.However, these methods require high-resolution processing to accurately estimate importance,which undermines their efficiency:the prediction stage itself must process the full-resolution image, consuming most of the computational budget.This high-resolution inference is necessary because each inputvaries significantly in structure and content.In this paper,we take a different approach and introduce a learn-to-downsample paradigm tailored for aligned vision recognition tasks,such as face and palmprint recognition,where input alignment ensures consistent spatial structure across images.Leveraging this property,we learn an input-agnostic downsampling template shared across all inputs.This enables significant acceleration while preserving task-specific performance,offering a more efficient and principled alternative to dynamic,prediction-based methods.Furthermore, instead of relying on implicit importance maps,we introduce a flow-based representation that directly models the spatial warping from the original image to thedownsampled version.Unlike importance map–based approaches that rely on gradients and default to uniform sampling in flat regions, our flow-based representation learns the sampling pattern directly, enabling more efficient downsampling even in textureless areas.Extensive experiments on face and palmprint recognition demonstrate that our method substantially reduces computational cost with minimal accuracy degradation, achieving a significantly better performance-efficiency trade-off than existing predictive downsampling methods.

Beyond Predictive Resampling: Learning Input-Agnostic Downsampling for Efficient Aligned Vision Recognition

Adversarial training is often modeled as a two-player zero-sum game, relying on strong assumptions that limit its practical guidance. In this paper, we instead analyze the interactions between training samples and show that even $\textbf{the fundamental objective—minimizing training loss—may not converge}$. To address this, we propose AT-Field, an adversarial training framework guided by sample-wise game-theoretic relationships. Specifically, we prove that training samples across different batches can form a none-potential game, where gradient descent induces cyclic behaviors, preventing convergence. By strategically searching and grouping these samples within the same batch, AT-Field transforms none-potential games into exact potential games, which are more effectively optimized using gradient-based methods. Experiments demonstrate that AT-Field integrates seamlessly with existing adversarial training techniques, enhancing both accuracy and robustness.

AT-Field: Rethinking the Games in Adversarial Training

Binary Decision Diagrams (BDDs) are instrumental in many electronic design automation (EDA) tasks thanks to their compact representation of Boolean functions. In BDD‑based reversible‑circuit synthesis, which is critical for quantum computing, the chosen variable ordering governs the number of BDD nodes and thus the key metrics of resource consumption, such as Quantum Cost. Because finding an optimal variable ordering for BDDs is an NP‑complete problem, existing heuristics often degrade as circuit complexity grows. We introduce BDD2Seq, a graph‑to‑sequence framework that couples a Graph Neural Network encoder with a Pointer‑Network decoder and Diverse Beam Search to predict high‑quality orderings. By treating the circuit netlist as a graph, BDD2Seq learns structural dependencies that conventional heuristics overlooked, yielding smaller BDDs and faster synthesis. Extensive experiments on three public benchmarks show that BDD2Seq achieves around $1.4\times$ lower Quantum Cost and $3.7\times$ faster synthesis than modern heuristic algorithms.
To the best of our knowledge, this is the first work to tackle the variable‑ordering problem in BDD‑based reversible‑circuit synthesis with a graph‑based generative model and diversity‑promoting decoding.

BDD2Seq: Enabling Scalable Reversible-Circuit Synthesis via Graph-to-Sequence Learning

Fair clustering has attracted increased attention in recent years. In this work, we study the individually fair $k$-means problem in Euclidean space. While single-swap local search methods have achieved near-linear running time and constant approximation guarantees, their performance often depends on the aspect ratio $\Delta$ of the dataset (the ratio between the diameter and the minimum interpoint distance of the dataset). How to apply multi-swap local search while obtaining linear running time with better approximation ratio is still a challenging task. To address this, we introduce a collaborative initialization framework for individually fair $k$-means that integrates greedy with sampling techniques. This framework eliminates dependence on the aspect ratio $\Delta$ and yields an $(O(1), 4)$-bicriteria approximation in linear time. While the current state-of-the-art near-linear time algorithm achieves a $(2000, 6)$-bicriteria approximation in $O(ndk^2 \log(n\Delta))$ time under the assumption that optimal centers are identical to their corresponding centroids, this assumption is generally not satisfied under individual fairness constraint. In contrast, we propose a multi-swap local search algorithm that improves the approximation guarantee to $(62, 7)$. Our method runs in linear time $O(nd \cdot \mathrm{poly}(k))$ with constant probability and eliminates the need for this restrictive assumption. We validate our theoretical results through extensive experiments on both real-world and synthetic datasets, including large-scale benchmarks with up to 100 million points. Our empirical evaluation demonstrates superior performance in terms of clustering quality and computational efficiency, along with scalability under varying parameter settings.

Linear Time Algorithms for Individually Fair k-means via Multi-Swap Local Search

Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision‑free metric that lets the reader gauge the semantic alignment of a retrieved summary with its hidden representation. SPS projects the passage summary embeddings onto the reader’s principal subspace and uses the residual as an immediate quality signal: the smaller the residual, the more the reader “expects” the passage summary. Building on SPS we present xCompress, an inference‑time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.

Beyond Perplexity: Let the Reader Select Retrieval Summaries via Spectrum Projection Score

Aligning text-to-image (T2I) diffusion models with human preferences has emerged as a critical research challenge.
While Direct Preference Optimization (DPO) has established a foundation for preference learning in large language models
(LLMs), its extension to diffusion models remains limited in alignment performance. In this work, we propose an enhanced
version of Diffusion-DPO by introducing a stable reference model update strategy. This strategy facilitates the exploration
of better alignment solutions while maintaining training stability. Moreover, we design a timestep-aware optimization
strategy that further boosts performance by addressing preference learning imbalance across timesteps. 
Through the synergistic combination of our exploration and timestep-aware optimization, our method significantly improves the alignment
performance of Diffusion-DPO on human preference evaluation benchmarks, achieving state-of-the-art results.

Rethinking Direct Preference Optimization in Diffusion Models

We study the fair allocation of indivisible goods across groups of agents, where each agent fully enjoys all goods allocated to their group.
We focus on groups of two (*couples*) and other groups of small size.
For two couples, an EF1 allocation — one in which all agents find their group's bundle no worse than the other group's, up to one good — always exists and can be found efficiently.
For three or more couples EF1 allocations need not exist.

Turning to proportionality, we show that, whenever groups have size at most $k$, a PROP$k$ allocation exists and can be found efficiently.
In fact, our algorithm additionally guarantees (fractional) Pareto optimality, and PROP1 to the first agent in each group, PROP2 to the second, etc., for an arbitrary agent ordering.
In special cases, we show that there are PROP1 allocations for any number of couples.

Fair Division Among Couples and Small Groups

Subset selection under budget constraints is critical in applications like multi-robot patrolling, crime deterrence, and targeted marketing, where multiple agents must jointly select targets and plan feasible routes. We formalize this challenge as Multi-Subset Selection with Budget-Constrained Routing (MSS-BCR), involving complex, non-additive cost structures that defy traditional methods. We propose GRIP, a graph-based framework integrating spatial reward fields and policy learning to enable coordinated, budget-aware target selection and routing. GRIP uses attention-based embeddings and constraint-triggered pruning with utility recovery to produce high-quality, feasible solutions. Experiments based on multiple synthetic and real-world datasets show GRIP outperforms baselines in reward efficiency and scalability across varied scenarios.

GRIP: Latent Field-Guided Graph Policy for Budget-Constrained Multi-Agent Routing

Dependency Quantified Boolean Formulas (DQBF) generalize QBF by explicitly specifying which universal variables each existential variable depends on, instead of relying on a linear quantifier order. The satisfiability problem of DQBF is NEXP-complete, and many hard problems can be succinctly encoded as DQBF. Recent work has revealed a strong analogy between DQBF and SAT: $k$-DQBF (with $k$ existential variables) is a succinct form of $k$-SAT, and satisfiability is NEXP-complete for $3$-DQBF but PSPACE-complete for $2$-DQBF, mirroring the complexity gap between $3$-SAT (NP-complete) and $2$-SAT (NL-complete).

Motivated by this analogy, we study the model counting problem for DQBF, denoted $\#$DQBF. Our main theoretical result is that $\#$2-DQBF is $\#$EXP-complete, where $\#$EXP is the exponential-time analogue of $\#$P. This parallels Valiant's classical theorem stating that $\#$2-SAT is $\#$P-complete. As a direct application, we show that first-order model counting (FOMC) remains $\#$EXP-complete even when restricted to a PSPACE-decidable fragment of first-order logic and domain size two.

Building on recent successes in reducing 2-DQBF satisfiability to symbolic model checking, we develop a dedicated 2-DQBF model counter. Using a diverse set of crafted instances, we experimentally evaluated it against a baseline that expands 2-DQBF formulas into propositional formulas and applies propositional model counting. While the baseline worked well when each existential variable depends on few variables, our implementation scaled significantly better to larger dependency sets. 

Missing details, code and data can be found in the supplementary material.

Downloads

Next from AAAI 2026

Fair Domain Generalization: An Information-Theoretic View

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Fair Domain Generalization: An Information-Theoretic View

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads