EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large language models (LLMs) are increasingly used as evaluators in natural language generation tasks, offering advantages in scalability and interpretability over traditional evaluation methods. However, existing LLM-based evaluations often suffer from biases and misalignment, particularly in domain-specific tasks, due to limited functional understanding and knowledge gaps. To address these challenges, we first investigate the relationship between an LLM-based evaluator’s familiarity with the target task and its evaluation performance. We then introduce the Co-Eval framework, which leverages a criteria planner model and optimized machine metrics to enhance the scalability and fairness of LLM-based evaluation. Experimental results on both general and domain-specific tasks demonstrate that Co-Eval reduces biases, achieving up to a 0.4903 reduction in self-preference bias, and improves alignment with human preferences, with gains of up to 0.324 in Spearman correlation.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models
poster

SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models

EMNLP 2025

+1Zhongyu Weijia WANG
Tingjuntao Ni and 3 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved