EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

LLM-as-a-judge has become a promising paradigm for evaluating natural language generation (NLG), but the lack of reliability limits its deployment in high-risk applications. It has been very common to use LLMs to directly evaluate LLM-generated content while uncertainty quantification for rating evaluation remains underexplored. This work presents the first analysis framework to offer interval evaluations in LLM-based scoring via conformal prediction. Conformal prediction constructs continuous confidence intervals from a single evaluation run and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. Extensive experiments and analysis across evaluators and conformal prediction methods show that our framework yields narrow intervals with reliable coverage, enabling more trustworthy evaluation for downstream decision making.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction
poster

AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction

EMNLP 2025

+3Jundong Li
Zihan Chen and 5 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved