Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large vision-language models (LVLMs) have achieved impressive results in vision-language tasks. However, Therefore, we propose LACING, designed to address such bias with MuunderlinetextbfLtimodal DuunderlinetextbfAl-attention MeunderlinetextbfChanunderlinetextbfIsm (MDA) aunderlinetextbfNd Soft-Image underlinetextbfGuidance (SIG). Specifically, MDA adopts a textbfparallel dual-attention mechanism that constructs separate attention for visual and text inputs to enhance integration of visual inputs across model. SIG uses a textbflearnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs during inference. Experiments across different model architectures and scales demonstrate that LACING effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without additional resources.