AAAI 2026

January 23, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large language models (LLMs) often perform better when prompted to explain their reasoning, but it remains unclear how well such gains persist as reasoning depth increases. In this work, we propose a depth-aware evaluation framework alongside the performance results on two structured datasets: CLUTRR (kinship reasoning) and ProofWriter (logical entailment), comparing direct vs. reasoning (reasoning depth = number of inference steps required) prompts across five models. Reasoning gave small gains at shallow depths but quickly weakened and often reversed as tasks grew more complex. In ProofWriter, GPT-5 reached 90% accuracy at depth four in direct model, yet its reasoning accuracy fell below baseline after depth two. Smaller open-source models showed only unstable or negligible gains, underscoring that reasoning in LLMs remains brittle with increased depth.

Downloads

SlidesPaper

Next from AAAI 2026

Game-Theoretic Simulations Meet AI: Fast Policy Recommendations Under Data Scarcity (Student Abstract)
poster

Game-Theoretic Simulations Meet AI: Fast Policy Recommendations Under Data Scarcity (Student Abstract)

AAAI 2026

Phuong Duong and 2 other authors

23 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved