EMNLP 2025

November 07, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large Language Models (LLMs) and their multimodal variations can now accept visual inputs, including images of text, raising an intriguing possibility: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for LLMs. Using visual input in place of long text, multimodal models can achieve comparable reasoning performance at significantly reduced input token cost. We demonstrate this on a challenging reasoning benchmark (BABILong~1k), where a state-of-the-art vision-language model (Gemini-2.5) attains higher accuracy than GPT-4.1 with up to 50% fewer tokens. We analyze when this approach succeeds and discuss the trade-offs of image vs.\ text inputs, highlighting a new avenue for improving LLM scalability and cost-efficiency in real-world deployments.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Sample Efficient Alignment Learning With Episodic Control
poster

Sample Efficient Alignment Learning With Episodic Control

EMNLP 2025

+2
Van Do and 4 other authors

07 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved