Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Machine translation quality has steadily improved over the years, with some recent benchmarks indicating that machine translation models produce near-perfect translations. Such error-free outputs are not useful for distinguishing between models and assessing whether there is still room for improvement in the field. Being able to automatically create difficult test sets holds promise for developing more discriminative evaluations. Unfortunately, reliable methods for automatically estimating translation difficulty do not exist yet, and no previous research has conducted a broad investigation into which approaches are the most effective. In this work, we formalize the task of translation difficulty estimation, defining the difficulty of a text by the quality of its translations. We evaluate baseline and novel methods intrinsically (with a dedicated evaluation measure), and as a tool for constructing challenging machine translation benchmarks. Our experiments demonstrate that dedicated models vastly outperform both heuristic-based methods, such as word rarity and syntactic complexity, and LLM-as-a-Judge approaches. Practically, given a large collection of source texts, our difficulty estimators are able to select examples where machine translation models underperform.