
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We present IMO-Bench, a suite of advanced reasoning benchmarks that aim for robustness in evaluation and specifically target the level of the International Mathematical Olympiad, the most prestigious venue for competitive math. IMO-Bench consists of diverse and challenging problems vetted by a panel of top IMO medalists and mathematicians. The first benchmark, IMO-AnswerBench, consists of 400 problems with verifiable answers curated from past Olympiad competitions and then altered by experts for robustness in evaluation. The latest frontier models struggle on this benchmark, with less than 48% accuracies in terms of matching the final answers. To advance the field beyond simple short-answer evaluation, we design IMO-ProofBench, consisting of both basic and novel problems, with detailed grading guidelines for full proof evaluation. Experts’gradings reveal that the best model achieves less than 36% max performance on this benchmark. Towards reducing grading cost, we share an automatic grader for the basic set that highly correlates with human expert evaluations. Last but not least, we construct, IMO-MistakeBench, a benchmark for identifying the first incorrect step in a full solution. Together, we hope the IMO-Bench contributes towards advancing robust mathematical reasoning.