Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
When leveraging large language models (LLMs) for multiple-choice question answering (MCQA), no convention exists regarding how the space following the last colon should be tokenized. We highlight this as an example for unclean LLM evaluation that complicates both NLP research and model use in an interdisciplinary context: we observe accuracy differences of up to 6% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings – a practical concern for leaderboards reliant on this prompt style. Surprisingly, we are able to recommend one specific strategy – tokenizing the space together with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.