IJCNLP-AACL 2025

December 21, 2025

Mumbai, India

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

keywords:

evidence triplets

spurious correctness

calibration

confidence estimation

large language models

multi-hop question answering

In the deployment of Large Language Models (LLMs), spurious correctness''---where answers are correct but reasoning contains errors---poses a critical risk by creating an illusion of reliability. While prior work on LLM confidence estimation focuses on answer-level or entire reasoning path confidence, these coarse-grained approaches fail to identify which specific parts of the reasoning contain errors. We propose a fine-grained confidence estimation framework that computes confidence scores for individual evidence triplets within reasoning chains, enabling precise localization of errors. Using carefully designed prompts, we generate answers, evidence in triplet format, and their respective confidence scores simultaneously, allowing automatic detection of spurious correctness patterns where partial evidence contains factual errors. Evaluated on both Japanese and English multi-hop QA benchmarks across multiple models from three model families representing different architectures and training approaches, we show that our approach exhibits superior calibration performance for evidence confidence and demonstrates effective ability to detect spurious correct answers (up to 0.84 on our primary discrimination metric). The consistent improvements across languages demonstrate the generalizability of our method. As a secondary benefit, joint generation of confidence scores improves answer confidence calibration by up to 43\%. This prompt-based approach requires no model retraining and is immediately applicable to existing LLMs.

Downloads

SlidesPaperTranscript English (automatic)

Next from IJCNLP-AACL 2025

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

IJCNLP-AACL 2025

+4Daniel KhashabiNiyati Bafna
Niyati Bafna and 6 other authors

21 December 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved