Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
keywords:
dataset generation
theory of mind
large language models
story generation
evaluation
Evaluating Theory of Mind (ToM) in Large Language Models (LLMs) is an important area of research for understanding the social intelligence of AI. Recent ToM benchmarks have made significant strides in enhancing the complexity, comprehensiveness, and practicality of evaluation. However, while the focus has been on constructing "more difficult'' or "more comprehensive'' tasks, there has been insufficient systematic analysis of the structural factors that inherently determine the difficulty of ToM reasoning—that is, "what'' makes reasoning difficult. To address this challenge, we propose a new dataset generation framework for ToM evaluation named AnaToM. To realize an "Anatomy of Difficulty'' in ToM reasoning, AnaToM strictly controls structural parameters such as the number of entities and the timeline in a story. This parameter control enables the isolation and identification of factors affecting the ToM of LLMs, allowing for a more precise examination of their reasoning mechanisms. The proposed framework provides a systematic methodology for diagnosing the limits of LLM reasoning abilities and offers new guidelines for future benchmark design.