Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Interpreting visual scenes with high-level reasoning is essential for many real-world applications—from autonomous systems to content moderation—but training and maintaining Vision-Language Models (VLMs) remains resource-intensive and opaque. In this work, we present CAPSTONE, a lightweight and modular framework designed for industrial settings. Instead of relying on multimodal training or fine-tuning large models, CAPSTONE transforms outputs from off-the-shelf vision models into structured text prompts that can be interpreted by a frozen Large Language Model (LLM). This plug-and-play architecture enables reasoning over visual input without access to raw pixels, dramatically reducing computational cost and complexity. On the POPE dataset, our system—using a 7B LLM—outperforms several fully trained VLMs in zero-shot evaluations, demonstrating strong generalization without retraining. CAPSTONE offers a scalable and interpretable alternative for companies looking to integrate visual reasoning capabilities without the burden of full-scale VLM pipelines.