Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Post-training quantization is a widely adopted technique for compressing large language models (LLMs), enabling efficient deployment in resource-constrained environments. However, recent studies have revealed that quantization—especially aggressive methods such as 4-bit QLoRA and Straight-Through Estimators (STE)—can significantly degrade a model’s safety alignment, increasing its susceptibility to harmful prompt completions and jailbreak behaviors. This research investigates the safety risks introduced by quantization and proposes a novel mitigation strategy: projecting quantized parameters back into safety-aligned subspaces. Building on prior work such as SafeLoRA, the study aims to empirically evaluate safety degradation across benchmark datasets (PureBad, Dialog Summary, Alpaca) using metrics like Harmfulness Score, Attack Success Rate (ASR), and StrongReject Score. The second phase explores projectionbased restoration techniques to recover alignment-preserving directions in parameter space. Finally, the effectiveness of these interventions will be assessed through end-to-end evaluations. By addressing the overlooked safety implications of model compression, this research contributes toward the development of robust, ethically aligned LLMs suitable for realworld deployment
