
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Fine-tuning pretrained large language models (LLMs) lies at the core of modern AI applications. Recent advances in fine-tuning methods—such as reinforcement learning (RL), have led to substantial improvements. However, multiple studies have shown that fine-tuning often degrades model safety, even in models explicitly trained for safety. In particular, LLMs fine-tuned for reasoning consistently exhibit increased safety risks, raising concerns about their deployment. In this work, we demonstrate that reinforcement learning with verifiable rewards (RLVR), a method often combined with SFT, can maintain safety guardrails without compromising reasoning performance. Our empirical evaluations provide quantitative evidence supporting this claim across diverse models and settings. Additionally, we present a theoretical framework that formalizes the safety preserving properties of RLVR, offering deeper insight.
