Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large reasoning models (LRMs) improve performance at test time by \emph{thinking longer}, but this often leads to overthinking and high computational cost. To address this, recent reinforcement learning (RL) methods adopt mechanical outcome-level rewards (e.g., rule- or prompt-based) that favor shorter correct paths, but frequently overlook reasoning quality. While such rewards neglect intermediate reasoning, dense supervision from process reward models (PRMs) has proven more effective in promoting coherent and high-quality reasoning. However, static PRM supervision introduces two challenges: \textit{reward hacking}, as fixed rewards poorly capture global reasoning objectives, and \textit{the high training cost} of obtaining dense reward labels at scale. To overcome these issues, we propose step group relative policy optimization (\textbf{Step-GRPO}), a GRPO-based method that integrates step-level PRM signals into sparse trajectory-level feedback, avoiding costly step-level supervision while enhancing reasoning quality beyond accuracy. In addition, Step-GRPO employs a step-attention mechanism that captures inter-step dependencies and emphasizes critical reasoning steps, effectively mitigating reward hacking. We apply Step-GRPO to train LLMs, achieving consistent gains in reasoning quality, accuracy, and shorter reasoning traces across multiple math benchmarks, outperforming RL baselines at substantially lower cost. Notably, the proposed model achieves 36.7\% accuracy on AIME’24 with 11K samples and $38 training cost, surpassing $1000+ baselines trained on 40K+ samples, demonstrating strong cost-effectiveness and scalability.
