AAAI 2026

January 23, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

“Refusals must be resilient, not brittle.” Yet guarding refusals against adversarial phrasing and shifting user contexts remains difficult: large language models (LLMs) still yield to jailbreak prompts that evade safety filters and surface harmful content. Despite gains from methods like reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT), these global controls blur refusal boundaries across domains such as violence, fraud, and privacy, and frequently collapse under adversarial variation. We propose Refusal Activation Steering (RAS), a training-free, inference-time method that uses contrastive activations to shift LLM responses, biasing generation trajectories toward refusals without altering model weights. The approach is modular and domain-targetable, avoiding collateral refusals on benign queries while strengthening activation-space boundaries for unsafe content. On adversarial evaluations with an 8B instruction-tuned model, we find that steering improves refusal rate by 52% and reduces attack success rate by 40%, establishing a lightweight and interpretable safety layer for robust refusal consistency.

Downloads

Paper

Next from AAAI 2026

LowRank-CAM: A Computationally Efficient and Interpretable Framework for Medical Image Analysis (Student Abstract)
poster

LowRank-CAM: A Computationally Efficient and Interpretable Framework for Medical Image Analysis (Student Abstract)

AAAI 2026

Nagaraju K and 2 other authors

23 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved