
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.
Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
VIDEO DOI: https://doi.org/10.48448/5e2t-e830
poster
Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground
keywords:
discourse and pragmatics
language model evaluation
theory of mind
Evaluating the theory of mind (ToM) capabilities of language models (LMs) has recently received a great deal of attention. However, many existing benchmarks rely on synthetic data, which risks misaligning the resulting experiments with human behavior. We introduce the first ToM dataset based on naturally occurring spoken dialogs, Common-ToM, and show that LMs struggle to demonstrate ToM. We then show that integrating a simple, explicit representation of beliefs improves LM performance on Common-ToM.