EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

LLM-as-a-judge evaluation metrics have gained popularity as an inexpensive and performant substitute for human evaluation. However, we find that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we propose a new meta-evaluation methodology that more closely aligns with practice by examining evaluators' ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that LLM evaluator correlations with human judgments falls from ~0.8 to ~0.3 when evaluated in realistic settings, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. Our meta-evaluation strategy demonstrates that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM (meta-)evaluation and recommend avenues for improvement.

Downloads

SlidesPaper

Next from EMNLP 2025

ReviewEval: An Evaluation Framework for AI-Generated Reviews
poster

ReviewEval: An Evaluation Framework for AI-Generated Reviews

EMNLP 2025

+3
Madhav Krishan Garg and 5 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved