Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of the smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning—which selects the appropriate sub-problem from multiple candidates—and solving, which addresses the sub-problem. It means that authentic reasoning has implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in teacher's reasoning path, which cannot distill this structure to student. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts the reasoning path into multiple meta-reasoning-solving steps and gives the reward to measure the alignment between the reasoning structures of student and teacher. Our RLKD combines this reward with RL, enables the student LLM to internalize the teacher’s implicit multi-branch structure in authentic reasoning, rather than merely mimicking fixed teacher's output paths. Experiments show that RLKD, even when trained on only 0.1% of the data under an RL-only regime, surpasses the performance of standard SFT-RL pipelines and further unleashes the potential reasoning ability of the student LLM than SFT-based distillation. Code is in supplemental material and will be released after review.