Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
As large language models (LLMs) are increasingly deployed in high-stakes domains such as education, healthcare, and law, accurately evaluating their nuanced reasoning process becomes essential to ensure their safety, reliability, and trustworthiness. However, most existing benchmarks evaluate LLMs at a coarse granularity. They emphasize end results and neglect complex reasoning steps, which leads to masking latent deficits, producing misleading high scores, and ultimately limiting accurate assessment of model suitability in complex real-world scenarios. To address these limitations, we introduce \textit{CogProbe}, a diagnostic benchmark that decomposes complex reasoning processes into orthogonal cognitive operations, featuring multilingual datasets \textit{CogEval} and cognitively informed metrics for fine-grained evaluation of LLM cognitive capabilities. Drawing from cognitive psychology, we design a comprehensive taxonomy of model capabilities, comprising 5 macro-cognitive capabilities and 17 corresponding micro-cognitive operations, which facilitates precise identification of latent weaknesses and provides detailed assessments of model capabilities, supporting informed deployment of LLMs in real-world scenarios. Experimental results demonstrate that our method can effectively assess implicit cognitive capabilities. They further reveal that, despite achieving high scores on traditional benchmarks, current LLMs exhibit significant cognitive deficits, particularly in metacognitive capability. Merely training models on coarse-grained datasets does not effectively enhance their underlying cognitive capabilities.
