EMNLP 2025

November 07, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Vision-language models (VLMs) are highly effective at semantic reasoning but struggle with a basic perceptual skill: recognizing hidden content in optical illusions and camouflaged images, which humans can perceive through simple adjustments like squinting or zooming. We introduce HC-Bench, a benchmark of over 1,200 images containing hidden text, objects, and illusions. Our evaluation across 11 state-of-the-art VLMs shows near-zero accuracy even when explicit prompts are provided, in stark contrast to human performance. Surprisingly, we find that downscaling the input image to a low resolution (32–128 pixels) restores model accuracy to over 99%. Additional experiments, including fine-tuning and image blurring, support the hypothesis that high-resolution inputs introduce redundant local features that interfere with global pattern recognition. These findings reveal a critical architectural blind spot in current VLMs and point toward the need for hybrid models with multi-scale visual processing. Our results have implications for applications in medical imaging, security, and other real-world settings that require robust visual understanding.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
poster

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

EMNLP 2025

+2Bela Gipp
Bela Gipp and 4 other authors

07 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved