Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The low sampling efficiency during the rollout phase poses a significant challenge to scaling reinforcement learning for large language model reasoning. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To address these challenges, we introduce $\textbf{C}$ompetence-$\textbf{D}$ifficulty $\textbf{A}$lignment $\textbf{S}$ampling ($\textbf{CDAS}$). This approach allows for accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies across problems. Subsequently, model competence is quantified to adaptively select problems whose difficulties align with the model's current competence using a fixed-point system. Extensive experiments in mathematical RL training show that $\textbf{CDAS}$ consistently outperforms strong baselines, achieving the highest average accuracy of 45.89\%. Furthermore, $\textbf{CDAS}$ reduces the training step time overhead by 57.06\% compared to the widely-used Dynamic Sampling strategy, verifying the efficiency of $\textbf{CDAS}$. Additional experiments on different tasks, model architectures, and model sizes demonstrate the generalization capability of $\textbf{CDAS}$.