Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
“Refusals must be resilient, not brittle.” Yet guarding refusals against adversarial phrasing and shifting user contexts remains difficult: large language models (LLMs) still yield to jailbreak prompts that evade safety filters and surface harmful content. Despite gains from methods like reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT), these global controls blur refusal boundaries across domains such as violence, fraud, and privacy, and frequently collapse under adversarial variation. We propose Refusal Activation Steering (RAS), a training-free, inference-time method that uses contrastive activations to shift LLM responses, biasing generation trajectories toward refusals without altering model weights. The approach is modular and domain-targetable, avoiding collateral refusals on benign queries while strengthening activation-space boundaries for unsafe content. On adversarial evaluations with an 8B instruction-tuned model, we find that steering improves refusal rate by 52% and reduces attack success rate by 40%, establishing a lightweight and interpretable safety layer for robust refusal consistency.