Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Marmoset monkeys exhibit complex vocal communication, challenging the view that primate vocalization is entirely innate, and show features of human speech, such as individual naming and turn-taking. Studying their communication offers a unique opportunity to link language with neural activity—especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate solely through speech, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocalizations. We design novel zero-shot evaluation metrics using unsupervised in-the-wild data alongside weakly labeled conversational data to assess GmSLM, demonstrating its advantage over a basic human-speech-based baseline. Generated vocalizations closely match real resynthesized samples acoustically and perform well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguishes real from artificial conversations. Importantly, this tool supports investigating the neural basis of vocal communication and provides a practical framework linking vocal behavior and brain activity. We believe GmSLM can benefit future work in neuroscience, bioacoustics, and evolutionary biology. Audio samples: https://anonymous.4open.science/w/anon_demo-6162/