Content not yet available

This lecture has no active video or poster.

AAAI 2026

January 24, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment through supervised fine-tuning and reinforcement learning from human feedback. These vulnerabilities manifest as differential safety behavior across token positions, with safety modifications concentrating in early positions while later positions show minimal distributional changes from base models. We provide a mechanistic analysis of safety alignment training dynamics, revealing that gradient concentration during autoregressive training creates signal decay across token positions. This leads to incomplete distributional learning where safety training fails to fully transform model preferences in later response regions. We introduce base-favored tokens as computational indicators of incomplete safety learning. Analysis reveals that while early positions undergo substantial distributional changes, later positions retain concerning base model preferences in safety-critical contexts, indicating systematic incomplete learning due to insufficient training signals. We develop a targeted completion method that addresses these undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates remarkable improvements in adversarial robustness, with dramatic reductions in attack success rates across multiple attack types while fully preserving general capabilities.

Downloads

Paper

Next from AAAI 2026

SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Sarcasm Detection
poster

SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Sarcasm Detection

AAAI 2026

+4
Yangbin Chen and 6 other authors

24 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved