EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against reference outputs from an encore model. This encore-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-lite, which combines direct head-to-head comparison of outputs from competing systems with a tournament structure, eliminating the need for encore outputs, reducing the number of required comparisons, and achieving higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-lite, streamlining model selection across research and industry communities.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
poster

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

EMNLP 2025

+1Bohan Yu
Lei Wang and 3 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved