China

Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks.

EMNLP 2025

Understanding Refusal in Language Models with Sparse Autoencoders

refusal

saes

mechanistic interpretability

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SafeBench, a benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 20 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety.

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model’s activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at https://anonymous.4open.science/r/CRI_for_paper-7BB5.

Jailbreak Attack Initializations as Extractors of Compliance Directions

The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Explaining Length Bias in LLM-Based Preference Evaluations

Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs’ internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose **SafeSwitch**, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models' capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls.

SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals

Hit identification is a central challenge in early drug discovery, traditionally requiring substantial experimental resources. Recent advances in artificial intelligence, particularly large language models (LLMs), have enabled virtual screening methods that reduce costs and improve efficiency. However, the growing complexity of these tools has limited their accessibility to wet-lab researchers. Multi-agent systems offer a promising solution by combining the interpretability of LLMs with the precision of specialized models and tools. In this work, we present MADD, a multi-agent system that builds and executes customized hit identification pipelines from natural language queries. MADD employs four coordinated agents to handle key subtasks in de novo compound generation and screening. We evaluate MADD across seven drug discovery cases and demonstrate its superior performance compared to existing LLM-based solutions. Using MADD, we pioneer application of AI-first drug design to five biological targets and release the identified hit molecules. Finally, we introduce a new benchmark of query-molecule pairs and docking scores for over three million compounds to contribute to the agentic future of drug design.

MADD: Multi-Agent Drug Discovery Orchestra

Financial documents such as 10-K filings pose significant retrieval challenges due to their length, formal structure, and domain-specific language—features often underutilized by standard retrieval-augmented generation (RAG) models. We present FinGEAR (Financial Mapping-Guided Enhanced Answer Retrieval, a retrieval framework tailored for financial document analysis. FinGEAR introduces a modular architecture that combines lexicon-guided filtering, dual-hierarchy indexing (via a Summary Tree and Question Tree), and cross-encoder reranking. This structure-aware design enables fine-grained retrieval aligned with financial discourse. Extensive evaluations show that FinGEAR significantly outperforms state-of-the-art RAG baselines across multiple retrieval metrics. By explicitly modeling document semantics and structure, FinGEAR improves retrieval fidelity and enhances downstream task performance, offering a principled solution for high-stakes financial information access.

FinGEAR: Financial Mapping-Guided Enhanced Answer Retrieval

With the continuous development of language models and the widespread availability of various types of accessible interfaces, large language models (LLMs) have been applied to an increasing number of fields. However, due to the vast amounts of data and computational resources required for model development, protecting the model's parameters and training data has become an urgent and crucial concern. Due to the revolutionary training and application paradigms of LLMs, many new attacks on language models have emerged in recent years. In this paper, we define these attacks as ``reverse engineering'' (RE) techniques on LMs and aim to provide an in-depth analysis of reverse engineering of language models. We illustrate various methods of reverse engineering applied to different aspects of a model, while also providing an introduction to existing protective strategies. On the one hand, it demonstrates the vulnerabilities of even black box models to different types of attacks; on the other hand, it offers a more holistic perspective for the development of new protective strategies for models.

Towards Reverse Engineering of Language Models: A Survey

Multi-digit addition is a clear probe of the computational power of large language models. To dissect the internal arithmetic processes in LLaMA-3-8B-Instruct, we combine linear probing with logit-lens inspection. Inspired by the step-by-step manner in which humans perform addition, we propose and analyze a coherent four-stage trajectory in the forward pass: Formula-structure representations become linearly decodable first, while the answer token is still far down the candidate list. Core computational features then emerge prominently. At deeper activation layers, numerical abstractions of the result become clearer, enabling near-perfect detection and decoding of the individual digits in the sum. Near the output, the model organizes and generates the final content, with the correct token reliably occupying the top rank. This trajectory suggests a hierarchical process that favors internal computation over rote memorization. We release our code and data to facilitate reproducibility.

Addition in Four Movements: Mapping Layer-wise Information Trajectories in LLMs

Recent studies have highlighted the remarkable knowledge retention capabilities of Large Language Models (LLMs) like GPT-4, while simultaneously revealing critical limitations in maintaining knowledge currency and accuracy. Existing knowledge editing methodologies, designed to update specific factual information without compromising general model performance, often encounter two fundamental challenges: parameter conflict during knowledge overwriting and excessive computational overhead. In this paper, we introduce ForGet (Forget for Get), a novel approach grounded in the principle of "forgetting before learning". By pinpointing the location within the LLM that corresponds to the target knowledge, we first erase the outdated knowledge and then insert the new knowledge at this precise spot. ForGet is the first work to leverage a two-phase gradient-based process for knowledge editing, offering a lightweight solution that also delivers superior results. Experimental findings show that our method achieves more effective knowledge editing at a lower cost compared to previous techniques across various base models.

Forget for Get: A Lightweight Two-phase Gradient Method for Knowledge Editing in Large Language Models

Large Language Models (LLMs) are trained on diverse and often conflicting knowledge spanning multiple domains and time periods. Some of this knowledge is only valid within specific temporal contexts, such as answering the question, ``Who is the President of the United States in 2022?'' Ensuring LLMs generate time-appropriate responses is crucial for maintaining relevance and accuracy. In this work we explore activation engineering as a method for temporally aligning LLMs to improve factual recall without any training. Activation engineering has predominantly been used to steer subjective and qualitative outcomes such as toxicity or behavior. Our research is one of few that uncovers the bounds of activation engineering on objective outcomes. We explore an activation engineering technique to anchor LLaMA 2, LLaMA 3.1, Qwen 2 and Gemma 2 to specific points in time and examine the effects of varying injection layers and prompting strategies. Our experiments demonstrate up to a 44\% and 16\% improvement in relative and explicit prompting respectively, achieving comparable performance to the fine-tuning method proposed by Zhao et al. (2024). Notably, for LLaMA 2 and LLaMA 3.1 our approach achieves similar results to the fine-tuning baseline while being significantly more computationally efficient and requiring no pre-aligned datasets. Both datasets and code will be released publicly upon paper acceptance.

Downloads

Next from EMNLP 2025

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES