Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To the best of our knowledge, SilVar is the first open-source, speech-driven VLM. Despite the challenges posed by speech-based instructions, experiments show that our speech-driven multimodal model performs on par with text-based models of similar size on the MMMU and ScienceQA benchmarks, demonstrating its potential in scenarios where text input is unavailable or not preferred, such as self-driving cars and medical surgery. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.