Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/qcm7-8d32

poster

ACL 2024

August 12, 2024

Bangkok, Thailand

Navigating the OverKill in Large Language Models

keywords:

large language models; safety; overkill

Large language models are meticulously aligned to be both helpful and harmless. However, recent research points to a potential overkill which means models may refuse to answer benign queries. In this paper, we investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to excessive attention to harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. Based on these insights, we introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon. We first extract such excessive attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. Then we determine the final next-token predictions by downplaying the excessive attention via contrastive decoding. Empirical results have indicated that our method has achieved an average reduction of the refusal rate by 20 % while having almost no impact on safety.

Downloads

SlidesTranscript English (automatic)

Next from ACL 2024

Spectral Filters, Dark Signals, and Attention Sinks
poster

Spectral Filters, Dark Signals, and Attention Sinks

ACL 2024

Nicola Cancedda

12 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved