Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Although large vision-language models (LVLMs) have demonstrated promising versatile capabilities on various downstream tasks, they are shown to be susceptible to adversarial examples. Existing LVLM attackers simply implement adversarial patterns in an impracticable setting: i) add digital global perturbations to entire input image; ii) access prior knowledge of LVLMs for optimization; iii) do not consider realistic transformations. These make them difficult to deploy in the physical-world attack scenarios. Motivated by the research gap and counter-practice phenomenon, this paper proposes the first practical LVLM attack method based on a novel adversarial patch design, which can achieve physical and digital attack settings without using any LVLM details. In particular, we introduce adversarial homogeneous constraints in both spatial and spectral domains to improve the patch stealthy for resisting potential real-world defenses. Besides, we also develop a new technique for synthesizing reasonably realistic transformations that capture the expected patch appearance variations in daily life. Extensive experiments are conducted to verify the strong adversarial capabilities of our proposed attack against prevalent LVLMs spanning a spectrum of tasks.