CogSci 2025

August 01, 2025

San Francisco, United States

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

keywords:

behavioral science

comparative studies

language understanding

linguistics

Large Language Models (LLMs) display an impressive set of capabilities in linguistic understanding. While advanced models outperform humans on certain tasks, LLM reasoning and linguistic competency differs from that of humans (Felin & Holweg, 2024; Mahowald et al., 2024; Niu et al., 2024). In this study, we evaluate humans and GPT-4o on the Winograd Schema Challenge, a pronoun resolution task. We focus on Japanese, a relatively understudied language in the emergent field of human-LLM evaluation. To assess human vs. LLM performance, we manipulate task demands and content. We report three findings: (i) Humans outperform LLMs in the baseline condition, i.e. the standard pronoun resolution task. (ii) As task demands increase, both human and LLM performance on the task declines (cf. Hu & Frank, 2024). (iii) We find evidence for content effects (cf. Lampinen et al., 2024): LLMs surpass humans as the content of the task is manipulated to favor LLMs.

Downloads

Paper

Next from CogSci 2025

Reasoning about similar causal structures among mechanical systems
poster

Reasoning about similar causal structures among mechanical systems

CogSci 2025

Caren WalkerMicah GoldwaterAlexandra Rett
Alexandra Rett and 2 other authors

01 August 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved