Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent spoken dialogue systems employ large language models with advanced reasoning capabilities as their core architecture. However, there exists a discrepancy between optimal textual and verbal delivery, which presents challenges in effectively leveraging the reasoning process for spoken communication. Although some efforts aim to adapt language models for more speech-suitable delivery, the impact of these modifications on the models' reasoning capabilities remains underexplored. In this work, we propose the Think-Verbalize-Speak framework that separates the reasoning process from the spoken content to fully harness the reasoning capabilities of LLMs in spoken dialogue. Specifically, we introduce an intermediate step between thinking and speaking, termed \textit{verbalizing}, in which the thought process is translated into comprehensible text. We also present ReVerT, a latency-efficient implementation of the verbalizer using incremental and asynchronous summarization. Extensive automatic and human evaluations across multiple benchmarks demonstrate that our approach improves speech naturalness and conciseness with minimal compromise to reasoning capabilities. We release both the dataset and its construction pipeline to facilitate future research.