AAAI 2026

January 22, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data such as images and text. However, the number of visual tokens in these models often far exceeds that of textual tokens, resulting in substantial redundancy and high inference costs. Existing pruning methods primarily rely on either unimodal information or cross-modal attention mechanisms. The former often overlooks the semantic alignment between instructions and visual representations in the multimodal space, while the latter is prone to attention drift and dispersion, leading to significant performance degradation under high pruning ratios. All the above issues stem from the lack of effective textual guidance during the pruning process. To identify effective informational cues for guiding pruning, we conduct an in-depth analysis of the interaction between language instructions and visual features based on the cross-modal information bottleneck attribution (CIBA) method, revealing the presence of noun anchors. Based on this analysis, we propose the Instruction-Guided Cross-Modal Clustering Token Pruning (ICCTP) method, a plug-and-play, training-free pruning paradigm. Specifically, ICCTP first leverages global attention to retain a small set of visual tokens that preserve global context. It then extracts nouns from the instruction as clustering centers to perform cross-modal clustering over the remaining visual tokens. To balance semantic diversity and global relevance while reducing intra-cluster redundancy, we design an importance scoring mechanism. Finally, visual tokens within each cluster are pruned according to a specified pruning ratio. We evaluate ICCTP on multiple VLM architectures, including LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-7B. Experimental results show that ICCTP maintains strong performance across various pruning rates without requiring retraining. Notably, even under an extreme setting where 94.4% of visual tokens are removed, ICCTP retains 90.02% of the original accuracy while reducing TFLOPs by 82.36%.

Downloads

SlidesPaperTranscript English (automatic)

Next from AAAI 2026

Comprehensive Urban Region Representation Learning via Multi-View Joint Learning and Contrastive Learning
technical paper

Comprehensive Urban Region Representation Learning via Multi-View Joint Learning and Contrastive Learning

AAAI 2026

+1Lu Jiang
Lu Jiang and 3 other authors

22 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved