Singapore

Vision-Language Models (VLMs) have achieved notable success in tasks such as visual question answering, yet their resilience to distractions in prompts remains underexplored. Understanding how distractions affect VLMs&#39; performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models (like GPT-4o) and those specialized for reasoning—against both visual and textual distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. Using this benchmark, we evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with noticeable degradation in reasoning when extraneous content is present. Notably, some models (such as GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies like prompt engineering. While these strategies improve resilience modestly, our analysis highlights considerable space for further improvement in VLM robustness.

AAAI 2026

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Vision-Language Models (VLMs) have achieved notable success in tasks such as visual question answering, yet their resilience to distractions in prompts remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models (like GPT-4o) and those specialized for reasoning—against both visual and textual distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. Using this benchmark, we evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with noticeable degradation in reasoning when extraneous content is present. Notably, some models (such as GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies like prompt engineering. While these strategies improve resilience modestly, our analysis highlights considerable space for further improvement in VLM robustness.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.

Plot’n Polish: Zero-Shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Dragonfly is an interconnect topology widely deployed in high-performance computing systems. A critical challenge in Dragonfly networks is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze and address this interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present SMART, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. SMART outperforms existing statistical and machine learning baselines and achieves inference in 0.5150 seconds, demonstrating its practical viability for runtime use and integration into hybrid simulation frameworks.

SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained \ac{RL} faces challenges regarding expressive policy updates, computational efficiency, and predictable runtime.
Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance.
We argue that accurate estimation of these characteristics is crucial in the action-constrained \ac{RL} setting, and propose efficient numerical approximations for them.
We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.

Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions

Model-free deep reinforcement learning (RL) algorithms have achieved tremendous success on a range of challenging tasks. However, safety concerns remain when these methods are deployed on real-world applications, necessitating risk-aware agents. A common utility for learning such risk-aware agents is the entropic risk measure, but current policy gradient methods optimizing this measure must perform high-variance and numerically unstable updates. As a result, existing risk-sensitive model-free approaches are limited to simple tasks and tabular settings. In this paper, we provide a comprehensive theoretical justification for policy gradient methods on the entropic risk measure, including on- and off-policy gradient theorems for the stochastic and deterministic policy settings. Motivated by theory, we propose risk-sensitive exponential actor-critic (rsEAC), an off-policy model-free approach that incorporates novel procedures to avoid the explicit representation of exponential value functions and their gradients, and optimizes its policy w.r.t. the entropic risk measure. In this way, we show that rsEAC produces more numerically stable updates compared to existing approaches and reliably learns risk-sensitive policies in challenging risky variants of continuous tasks in MuJoCo.

Risk-Sensitive Exponential Actor Critic

Designing mechanical mechanisms to trace specific paths is a classic yet notoriously difficult engineering problem, characterized by a vast and complex search space of discrete topologies and continuous parameters. We introduce MechaFormer, a Transformer-based model that tackles this challenge by treating mechanism design as a conditional sequence generation task. Our model learns to translate a target curve into a domain-specific language (DSL) string, simultaneously determining the mechanism's topology and geometric parameters in a single, unified process. MechaFormer significantly outperforms existing baselines, achieving state-of-the-art path-matching accuracy and generating a wide diversity of novel and valid designs. We demonstrate a suite of sampling strategies that can dramatically improve solution quality and offer designers valuable flexibility. Furthermore, we show that the high-quality outputs from MechaFormer serve as excellent starting points for traditional optimizers, creating a hybrid approach that finds superior solutions with remarkable efficiency.

MechaFormer: Sequence Learning for Kinematic Mechanism Design Automation

AI assistants that assume users to be fully rational generally fail to understand as well as predict users' sub-optimal behaviors, making it difficult to provide adaptive assistance.
In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world.
In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. 
The key novelty lies in explicitly modeling how a bounded cognitive process, such as imperfect memory, leads to a dynamically inconsistent and biased belief state, and consequently, sub-optimal sequential decision-making.
We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly.
We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable.
To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user's latent belief state and estimates the unknown cognitive bound from a stream of observed actions.
We validate our approach in a representative navigation task using memory capacity as an instance of cognitive bound. 
With simulations, we show that (1) our CR model generates intuitively plausible behaviours corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations ($\le 100$ steps). 
We further demonstrate how this modeling approach provides a principled foundation for developing adaptive AI assistants within an assistive-POMDP, enabling adaptive assistance that takes the user's cognitive bounds into account.

More than Irrational: Modeling Belief-Biased Agents

State-of-the-art methods for Human-AI Teaming and Zero-shot Cooperation focus on task completion i.e. task rewards, as the sole evaluation metric while being agnostic to `how' the two agents work with each other. Furthermore, subjective user studies only offer limited insight into the quality of cooperation existing within the team. Specifically, we are interested in understanding the cooperative behaviors arising within the team when trained agents are paired with humans - a problem that has been overlooked by the existing literature. To formally address this problem, we propose the concept of constructive interdependence - measuring how much agents rely on each other’s actions to achieve the shared goal - as a key metric for evaluating cooperation in human-agent teams. We measure interdependence in terms of action interactions in a STRIPS formalism, and define metrics that allow us to assess the degree of reliance between the agents' actions. We pair state-of-the-art agents with learned human models as well as human participants in a user study for the popular Overcooked domain, and evaluate the task reward and teaming performance for these human-agent teams. While prior work has claimed that state-of-the-art agents exhibit cooperative behavior based on their high task rewards, our results reveal that these agents often fail to induce cooperation, as evidenced by consistently low interdependence across teams. Furthermore, our analysis reveals that teaming performance is not necessarily correlated with task reward, highlighting that task reward alone cannot reliably measure cooperation arising in a human-agent team.

Who Is Helping Whom? Analyzing Inter-Dependencies to Evaluate Cooperation in Human-AI Teaming

As machine learning (ML) applications grow increasingly complex in recent years, modern ML frameworks often need to address multiple potentially conflicting objectives with coupled decision variables across different layers. This creates a compelling need for multi-objective bilevel learning (MOBL). So far, however, the field of MOBL remains in its infancy and many important problems remain under-explored. This motivates us to fill this gap and systematically investigate the theoretical and algorithmic foundation of MOBL. Specifically, we consider MOBL problems with multiple conflicting objectives guided by preferences at the upper-level subproblem, where part of the inputs depend on the optimal solution of the lower-level subproblem. Our goal is to develop efficient MOBL optimization algorithms to (1) identify a preference-guided Pareto-stationary solution with low oracle complexity; and (2) enable systematic Pareto front exploration. To this end, we propose a unifying algorithmic framework called weighted-Chebyshev multi-hyper-gradient-descent (WC-MHGD) for both deterministic and stochastic settings with finite-time Pareto-stationarity convergence rate guarantees, which not only implies low oracle complexity but also induces systematic Pareto front exploration. We further conduct extensive experiments to confirm our theoretical results.

Multi-Objective Bilevel Learning

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Conformal prediction (CP) is a general framework to quantify the predictive uncertainty of machine learning models that uses a set prediction to include the true label with a valid probability.
To align the uncertainty measured by CP, conformal training methods minimize the size of the prediction sets.
A typical way is to use a surrogate indicator function, usually Sigmoid or Gaussian error function.
However, these surrogate functions do not have a uniform error bound to the indicator function, leading to uncontrollable learning bounds.
In this paper, we propose a simple cost-sensitive conformal training algorithm that does not rely on the indicator approximation mechanism.
Specifically, we theoretically show that minimizing the expected size of prediction sets is upper bounded by the expected rank of true labels.
To this end, we develop an importance weighting strategy that assigns the weight using the rank of true label on each data.
Our analysis provably demonstrates the tightness between the proposed weighted objective and the expected size of conformal prediction sets.
Extensive experiments verify the validity of our theoretical insights, 
and superior empirical performance over other conformal training in terms of predictive efficiency with $21.38\\%$ reduction for average prediction set size.

Content not yet available

Downloads

Next from AAAI 2026

Plot’n Polish: Zero-Shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Content not yet available

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Plot’n Polish: Zero-Shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads