EMNLP 2025

November 06, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Safety alignment is critical in pre-trained large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose textitAdversary-aware DPO (ADPO), a novel training framework that explicitly considers adversary. textitAdversary-aware DPO (ADPO) integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. textitADPO introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversary-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, textitADPO ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that textitADPO outperforms baselines in terms of both safety alignment and general utility of VLMs.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion
poster

SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

EMNLP 2025

+4
Jiafan Li and 6 other authors

06 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved