Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Competitive programming problems, due to their high reasoning difficulty and precise correctness feedback, have become a key benchmark for evaluating the reasoning capabilities of large language models (LLMs), playing a pivotal role in both LLM evaluation and reinforcement learning (RL) training. However, while available public datasets gather problems from platforms like Codeforces and somehow generate additional test cases, their test cases often fall short in quality compared to official ones, resulting in inaccurate evaluations. In this paper, we introduce an agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new dataset with improved test cases, CodeContest+. We evaluated the accuracy of both using 1.72 million real-world submissions. Results show that CodeContests+ has a significantly higher evaluation accuracy than CodeContests and has better performance in RL training.