Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large Language Models (LLMs) and their multimodal variations can now accept visual inputs, including images of text, raising an intriguing possibility: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for LLMs. Using visual input in place of long text, multimodal models can achieve comparable reasoning performance at significantly reduced input token cost. We demonstrate this on a challenging reasoning benchmark (BABILong~1k), where a state-of-the-art vision-language model (Gemini-2.5) attains higher accuracy than GPT-4.1 with up to 50% fewer tokens. We analyze when this approach succeeds and discuss the trade-offs of image vs.\ text inputs, highlighting a new avenue for improving LLM scalability and cost-efficiency in real-world deployments.