Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Vision-Language Models (VLMs) have achieved notable success in tasks such as visual question answering, yet their resilience to distractions in prompts remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models (like GPT-4o) and those specialized for reasoning—against both visual and textual distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. Using this benchmark, we evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with noticeable degradation in reasoning when extraneous content is present. Notably, some models (such as GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies like prompt engineering. While these strategies improve resilience modestly, our analysis highlights considerable space for further improvement in VLM robustness.