EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To the best of our knowledge, SilVar is the first open-source, speech-driven VLM. Despite the challenges posed by speech-based instructions, experiments show that our speech-driven multimodal model performs on par with text-based models of similar size on the MMMU and ScienceQA benchmarks, demonstrating its potential in scenarios where text input is unavailable or not preferred, such as self-driving cars and medical surgery. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

RedHerring Attack: Testing the Reliability of Attack Detection
poster

RedHerring Attack: Testing the Reliability of Attack Detection

EMNLP 2025

Jonathan Rusert
Jonathan Rusert

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved