Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Hallucination in Large Vision-Language Models (LVLMs) remains a critical challenge, undermining their reliability in real-world applications. Existing studies have investigated the causes of hallucination at the modality level and proposed effective strategies. However, interaction patterns beyond the modality level remain insufficiently explored. In this paper, we conduct a token-level analysis and identify two key phenomena: (1) a small subset of textual tokens in LVLMs exert disproportionate influence in the visual-active layers, surpassing that of the visual modality and potentially misleading visual understanding; (2) while LVLMs can correctly identify key visual information, insufficient focus on these cues can sometimes lead to hallucinations. Based on such observation, we attribute hallucinations in LVLMs to two token-level causes: the disproportionate influence of certain textual tokens (phantom tokens) and the underutilization of critical visual cues (anchor tokens). To mitigate these issues, we introduce Token-Asymmetric Filtering (TAF)—a training-free, plug-and-play method that modulates intermediate attention maps in LVLMs. TAF isolates the influence of phantom tokens and emphasizes the influence of anchor tokens in the visual-active layers. Experimental results across multiple benchmarks demonstrate that TAF significantly mitigates hallucinations across a range of state-of-the-art LVLMs. The code will be released.
