Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against reference outputs from an encore model. This encore-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-lite, which combines direct head-to-head comparison of outputs from competing systems with a tournament structure, eliminating the need for encore outputs, reducing the number of required comparisons, and achieving higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-lite, streamlining model selection across research and industry communities.