IJCNLP-AACL 2025

December 22, 2025

Mumbai, India

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

keywords:

proof planning

logical deduction

nlp datasets

benchmarking

reasoning

evaluation

Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) LLMs fine-tuned with incentives for step‑by‑step argumentation and self‑verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.

Downloads

SlidesPaperTranscript English (automatic)

Next from IJCNLP-AACL 2025

Noise May Drown Out Words but Foster Compositionality: The Advantage of the Erasure and Deletion Noisy Channels on Emergent Communication

Noise May Drown Out Words but Foster Compositionality: The Advantage of the Erasure and Deletion Noisy Channels on Emergent Communication

IJCNLP-AACL 2025

Cezary KlamraRaquel Alhama
Raquel Alhama and 2 other authors

22 December 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved