
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Humans are remarkably adept at inferring the causes of events in our environment; doing so often requires incorporating information from multiple sensory modalities. For instance, if a car slows down in front of us, inferences about why they did so are rapidly revised if we also hear sirens in the distance. Here, we investigate the ability to reconstruct others' actions and events from the past by integrating multimodal information. Participants were asked to infer which of two agents performed an action in a household setting given either visual evidence, auditory evidence, or both. We evaluate our task on humans, a large language model (GPT-4), and a large multimodal model (GPT-4V). We find that humans are relatively accurate overall and perform best when given multimodal evidence, seeming to put more emphasis on visual evidence than on auditory evidence. GPT-4's overall accuracy closely matches that of humans in all modalities, but is only weakly correlated with human accuracy across trials, suggesting different reasoning mechanisms. Meanwhile, GPT-4V has lower accuracy and exhibits no evidence of incorporating multimodal information. People's ability to reconstruct the behavior of others relies on successfully integrating evidence across different senses. Such multimodal reasoning presents an intriguing challenge for multimodal AI systems.
Authors:
Sarah A Wu: Stanford University; Erik Brockbank: Stanford University; Hannah Cha: Stanford University; Jan-Philipp Fränken: University of Edinburgh; Emily Jin: Stanford University; Zhuoyi Huang: Stanford; Weiyu Liu: Stanford University; Ruohan Zhang: Stanford University; Jiajun Wu: Stanford University; Tobias Gerstenberg: Stanford University
