The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs' capability to perform \emph{course-correction}, \ie, the model can steer away from generating harmful content autonomously. First, we introduce the C$^2$-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C$^2$-Syn, a synthetic C$^2$-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning. Experiments on \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B} show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

Course-Correction: Safety Alignment Using Synthetic Preferences

Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines Competitive Index and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach.

Alignment-Enhanced Decoding: Defending via Token-Level Adaptive Refining of Probability Distributions

Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper, we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak disturbs the transformation of early unethical classification into negative emotions. We conduct experiments on models from 7B to 70B across various model families to prove our conclusion. Overall, our paper indicates the intrinsical mechanism of LLM safety and how jailbreaks circumvent safety guardrails, offering a new perspective on LLM safety and reducing concerns.

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Main conference posters, Findings, Industry track and Demos are being presented during this session.

Virtual Poster Session 2

poster

## Welcome to EMNLP 2024! 
We are excited to welcome you to one of the most prominent conferences in the field of Natural Language Processing. This year, EMNLP 2024 is being held in a hybrid format,
offering both virtual and in-person participation in beautiful Miami. Due to a record-breaking number of submissions, we've expanded the total number of accepted papers to accommodate more cutting-edge research from around the globe.
### [Conference Handbook](https://drive.google.com/file/d/1WPROgxjLAC96AJL7Ugy0tEnYm7dkrbHt/view?usp=sharing)

You are required to register for this event. **Please register [here](https://2024.emnlp.org/registration/).** The EMNLP 2024 event page on Underline will be open to public one week prior to the event.

Please register!

EMNLP 2024

EMNLP 2024 will take place in Miami, Florida from Nov 12th to Nov 16th, 2024, at the Hyatt Regency Miami Hote and on Underline for remote participants.

This poster session includes Main Conference posters and Findings from the following areas:

Language Modeling • Ethics, Bias, and Fairness • Discourse and Pragmatics • Multilinguality and Language Diversity • Phonology, Morphology, and Word Segmentation • Syntax: Tagging, Chunking and Parsing

Zhenhong Zhou

3

SHORT BIO

Presentations

Course-Correction: Safety Alignment Using Synthetic Preferences

Alignment-Enhanced Decoding: Defending via Token-Level Adaptive Refining of Probability Distributions

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES