AAAI 2026

January 22, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

The ability of Large Language Models (LLMs) to use ex ternal tools unlocks powerful real-world interactions, mak ing rigorous evaluation essential. However, current bench marks primarily report final accuracy, revealing what mod els can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple per formance scoring to a diagnostic tool, we introduce a frame workgroundedinCognitive LoadTheory.Ourframeworkde constructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solu tion path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically ad justable cognitive load. Our evaluation reveals distinct per formance cliffs as cognitive load increases, allowing us to precisely map each model’s capability boundary. We validate that our framework’s predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent’s limits and a practical foundation for building more efficient systems.

Downloads

Paper

Next from AAAI 2026

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
poster

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

AAAI 2026

+3
Ling Chen and 5 other authors

22 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved