Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Although speech models are expected to align well with brain language processing when subjects are involved in listening, recent studies find that they cannot capture brain-relevant semantics beyond low-level features. Surprisingly, text models are better aligned since they can capture brain-relevant semantics. No previous study has investigated the alignment effectiveness of text/speech representations from multimodal models. Can speech embeddings from such multimodal models capture brain-relevant semantics through cross-modal interactions? Which modality can leverage this synergistic multimodal understanding for improved alignment with brain language processing? Can text/speech representations from such multimodal models outperform unimodal models? To address these questions, we systematically analyze multiple multimodal models, extracting both text- and speech-based representations to assess their alignment with MEG brain recordings during naturalistic story listening. We find that text embeddings from both multimodal and unimodal models significantly outperform speech embeddings from these models. Specifically, multimodal text embeddings exhibit a peak around 200 ms, suggesting that they benefit from speech embeddings, with heightened activity during this time period. However, speech embeddings from these multimodal models still show similar alignment compared to their unimodal counterparts, suggesting that they do not gain meaningful semantic benefits over text-based representations. These results highlight an asymmetry in cross-modal knowledge transfer where text modality benefits more from speech information but not vice versa.