Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
As machine translation systems approach human-level quality, traditional evaluation methodologies struggle to detect subtle translation errors. We critically examine limitations in current gold-standard approaches (MQM and ESA), including excessive categorization complexity, rough severity granularity, significant bias towards accuracy at the expense of fluency, and concerning annotation time constraints. Through in-depth analysis of English-Russian translations from WMT24, we demonstrate that employing highly qualified professional translators without strict time limitations produces substantially different results from standard evaluations. We propose RATE (Refined Assessment for Translation Evaluation) framework and collect high-quality annotations with streamlined error categorization, expanded severity ratings and multidimensional scoring that balances accuracy and fluency assessments. Our analysis reveals that state-of-the-art MT systems may have surpassed human translations in accuracy while still lagging in fluency, a critical distinction obscured by existing accuracy-biased metrics. Our findings indicate that improving evaluation depth and expertise may be as critical to advancing the field as developing better translation systems.