EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Competitive programming problems, due to their high reasoning difficulty and precise correctness feedback, have become a key benchmark for evaluating the reasoning capabilities of large language models (LLMs), playing a pivotal role in both LLM evaluation and reinforcement learning (RL) training. However, while available public datasets gather problems from platforms like Codeforces and somehow generate additional test cases, their test cases often fall short in quality compared to official ones, resulting in inaccurate evaluations. In this paper, we introduce an agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new dataset with improved test cases, CodeContest+. We evaluated the accuracy of both using 1.72 million real-world submissions. Results show that CodeContests+ has a significantly higher evaluation accuracy than CodeContests and has better performance in RL training.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps
poster

Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

EMNLP 2025

+4
Ran Chen and 6 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved