China

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.

EMNLP 2025

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

sparse autoencoder

mechanistic interpretability

survey

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

With the development of Large Language Models (LLMs), numerous efforts have revealed their vulnerabilities to jailbreak attacks. Although these studies have driven the progress in LLMs' safety alignment, it remains unclear whether LLMs have internalized authentic knowledge to deal with real-world crimes, or are merely forced to simulate toxic language patterns. This ambiguity raises concerns that jailbreak success is often attributable to a hallucination loop between jailbroken LLM and judger LLM. By decoupling the use of jailbreak techniques, we construct knowledge-intensive Q&A to investigate the misuse threats of LLMs in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment robustness. Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns. Our study reveals a gap between existing LLM safety assessments and real-world threat potential.

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

This paper proposes Complete Textual Concept Bottleneck Model (CT-CBM), a novel TCBM generator building concept labels in a fully unsupervised manner using a small language model, eliminating both the need for predefined human labeled concepts and LLM annotations. CT-CBM iteratively targets and adds important and identifiable concepts in the bottleneck layer to create a complete concept basis. CT-CBM achieves striking results against competitors in terms of concept basis completeness and concept detection accuracy, offering a promising solution to reliably enhance interpretability of NLP classifiers.

Towards Achieving Concept Completeness for Textual Concept Bottleneck Models

Large language models (LLMs) are trained using massive datasets. However, these datasets often contain undesirable content, e.g., harmful texts, personal information, and copyrighted material. To address this, \emph{machine unlearning} aims to remove information from trained models. Recent work has shown that soft token attacks (\sta) can successfully extract unlearned information from LLMs. In this work, we show that \sta{s} can be an inadequate tool for auditing unlearning. Using common unlearning benchmarks (\textit{Who Is Harry Potter?} and \textit{TOFU}), we demonstrate that, in a \emph{strong auditor} setting, such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus. Also, we show that \sta with just a few soft tokens (1-10) can elicit random strings over 400-characters long. Thus showing that \sta{s} must be used carefully to effectively audit unlearning.

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Oversensitivity—where language models defensively reject benign prompts—not only disrupts user interactions but also obscures the boundaries between harmful and harmless content. Existing benchmarks rely on static datasets that degrade over time as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model’s unique behavior. Building on this approach, we construct OverBench, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 26 models. OverBench provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.

Dynamic Evaluation for Oversensitivity in LLMs

Currently, large language models (\textbf{L}LMs) based \textbf{O}pen domain \textbf{N}atural language plannin\textbf{G} (LONG) has considerable room for improvement. E.g., nonreusable plans with incomplete intermediate states and missing steps hinder real-world applications. To remedy these flaws, this paper establishes a dataset with a baseline for LONG. The dataset, GOLD, provides the largest dataset for textual procedures with corresponding reusable formal planning domain definitions to date. The baseline, DIGGER, leverages entity-attribute-level action models, which reveal relevant implicit physical properties (aka attributes) of salient entities in actions. DIGGER first extracts action models and builds typed entity lists from textual procedures. Then, it builds goal states for new tasks and instantiates grounded actions using domain prediction. At last, plans are generalized and translated into textual procedures by LLM. Reference-based metrics, LLM-as-Judge, and human evaluation are employed to evaluate LONG comprehensively. Experiments on GOLD validate that DIGGER is stronger and more generalizable than recently proposed approaches and LLMs. I.e., DIGGER is the best in seen domains and applicable to unseen domains without adaptation. Specifically, the best BLEU-1 score increased from 0.385 to 0.408 on seen domains and boosted to 0.310 on unseen domains.

LLM-based Open Domain Planning by Leveraging Entity-Attribute-Level Domain Models

Cross-prompt trait scoring task aims to learn generalizable scoring capabilities from source- prompt data, enabling automatic scoring across multiple dimensions on unseen essays. Existing research on cross-prompt trait essay scoring primarily focuses on improving model generalization by obtaining prompt-invariant representations. In this paper, we approach the research problem from a different perspective on invariance learning and propose a scoring-invariant learning objective. This objective encourages the model to focus on intrinsic information within the essay that reflects its quality during training, thereby learning generic scoring features. To further enhance the model's ability to score across multiple dimensions, we introduce a trait feature extraction network based on routing gates into the scoring architecture and propose a trait consistency scoring objective to encourage the model to balance the diversity of trait-specific features with scoring consistency across traits when learning trait-specific essay features. Extensive experiments demonstrate the effectiveness of our approach, showing advantages in multi-trait scoring performance and achieving significant improvements with low-resource prompts.

Improving Prompt Generalization for Cross-prompt Essay Trait Scoring from the Scoring-invariance Perspective

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, they remain vulnerable to semantic inconsistency, where minor, semantically equivalent variations in input formatting result in divergent predictions. Our comprehensive evaluation reveals that this brittleness persists even in state-of-the-art models such as GPT-4o, posing a serious challenge to their reliability. Through a mechanistic analysis, we attribute this phenomenon to deep representational failures, whereby semantic-equivalent input changes induce instability in the model’s internal representations. We further examine standard mitigation strategies and uncover their fundamental limitations. In particular, even direct fine-tuning on format variations frequently fails to yield format-invariant semantic representations, highlighting the difficulty of the problem. By explaining the failure of existing methods through our representational diagnosis, we underscore the need for representation-aware strategies to achieve robust and reliable LLM behavior.

When Format Changes Meaning: Investigating Semantic Inconsistency of Large Language Models

Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.

ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts

The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities. The source code, prompts, Blender scenarios, and rendered clips are available at https://anonymous.4open.science/r/eHMI_action_design/.

Automating eHMI Action Design with LLMs for Automated Vehicle Communication

Large Language Model (LLM)-based self-refinement methods have significantly enhanced data analysis performance, especially in correcting errors for Text-to-SQL. However, their effectiveness diminishes when addressing SQL semantic errors, since LLM hallucinations cause persistent biases in the semantic understanding of questions, leading to an uncorrectable situation. To solve this problem, we propose \textbfTest-driven \textbfSelf-refinement for Text-to-SQL \textbf(TS-SQL). It leverages a collaborative LLM agent framework to automatically synthesize high-quality test cases, including test data and test code. The test cases are further employed to provide execution feedback for LLM self-refinement towards SQL semantic errors. Rigorous evaluation shows the superiority of TS-SQL: for BIRD-dev, TS-SQL improves at least \textbf6\\% over existing SQL self-refinement methods; for Spider-dev, TS-SQL identifies and corrects \textbf131 gold SQL errors, exposing system flaws in benchmark rigor. For reproducibility, we release the modified Spider-dev benchmark to foster further research.

Downloads

Next from EMNLP 2025

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES