Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background

AAAI 2026

July 06, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Fine-tuning pretrained large language models (LLMs) lies at the core of modern AI applications. Recent advances in fine-tuning methods—such as reinforcement learning (RL), have led to substantial improvements. However, multiple studies have shown that fine-tuning often degrades model safety, even in models explicitly trained for safety. In particular, LLMs fine-tuned for reasoning consistently exhibit increased safety risks, raising concerns about their deployment. In this work, we demonstrate that reinforcement learning with verifiable rewards (RLVR), a method often combined with SFT, can maintain safety guardrails without compromising reasoning performance. Our empirical evaluations provide quantitative evidence supporting this claim across diverse models and settings. Additionally, we present a theoretical framework that formalizes the safety preserving properties of RLVR, offering deeper insight.

Next from AAAI 2026

Beyond Verification: Abductive Explanations for Post-AI Assessment of Privacy Leakage
workshop paper

Beyond Verification: Abductive Explanations for Post-AI Assessment of Privacy Leakage

AAAI 2026

06 July 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved