Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
LLM-as-a-judge evaluation metrics have gained popularity as an inexpensive and performant substitute for human evaluation. However, we find that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we propose a new meta-evaluation methodology that more closely aligns with practice by examining evaluators' ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that LLM evaluator correlations with human judgments falls from ~0.8 to ~0.3 when evaluated in realistic settings, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. Our meta-evaluation strategy demonstrates that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM (meta-)evaluation and recommend avenues for improvement.