EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

When leveraging large language models (LLMs) for multiple-choice question answering (MCQA), no convention exists regarding how the space following the last colon should be tokenized. We highlight this as an example for unclean LLM evaluation that complicates both NLP research and model use in an interdisciplinary context: we observe accuracy differences of up to 6% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings – a practical concern for leaderboards reliant on this prompt style. Surprisingly, we are able to recommend one specific strategy – tokenizing the space together with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
poster

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

EMNLP 2025

+3
Chenhe Gu and 5 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved