Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Direct Preference Optimization (DPO) typically relies on a fixed inverse-temperature β that controls divergence from a reference model. Fixed β is brittle: too small causes underregularization (verbosity, safety drift); too large causes overregularization (underfitting). I propose an adaptive per-token KL controller using EMA smoothing, deadband filtering, and clipping to dynamically adjust β throughout training. Initial results on a 7B model show 72% win rate vs. base SFT and 60% vs. fixed-β DPO. The goal is a practical recipe for stable, compute-efficient DPO with reduced manual tuning.