China

Large language models (LLMs) are trained using massive datasets. However, these datasets often contain undesirable content, e.g., harmful texts, personal information, and copyrighted material. To address this, \emph{machine unlearning} aims to remove information from trained models. Recent work has shown that soft token attacks (\sta) can successfully extract unlearned information from LLMs. In this work, we show that \sta{s} can be an inadequate tool for auditing unlearning. Using common unlearning benchmarks (\textit{Who Is Harry Potter?} and \textit{TOFU}), we demonstrate that, in a \emph{strong auditor} setting, such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus. Also, we show that \sta with just a few soft tokens (1-10) can elicit random strings over 400-characters long. Thus showing that \sta{s} must be used carefully to effectively audit unlearning.

EMNLP 2025

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

statistical testing for evaluation

automatic evaluation of datasets

reproducibility

benchmarking

evaluation

metrics

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Oversensitivity—where language models defensively reject benign prompts—not only disrupts user interactions but also obscures the boundaries between harmful and harmless content. Existing benchmarks rely on static datasets that degrade over time as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model’s unique behavior. Building on this approach, we construct OverBench, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 26 models. OverBench provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.

Dynamic Evaluation for Oversensitivity in LLMs

Currently, large language models (\textbf{L}LMs) based \textbf{O}pen domain \textbf{N}atural language plannin\textbf{G} (LONG) has considerable room for improvement. E.g., nonreusable plans with incomplete intermediate states and missing steps hinder real-world applications. To remedy these flaws, this paper establishes a dataset with a baseline for LONG. The dataset, GOLD, provides the largest dataset for textual procedures with corresponding reusable formal planning domain definitions to date. The baseline, DIGGER, leverages entity-attribute-level action models, which reveal relevant implicit physical properties (aka attributes) of salient entities in actions. DIGGER first extracts action models and builds typed entity lists from textual procedures. Then, it builds goal states for new tasks and instantiates grounded actions using domain prediction. At last, plans are generalized and translated into textual procedures by LLM. Reference-based metrics, LLM-as-Judge, and human evaluation are employed to evaluate LONG comprehensively. Experiments on GOLD validate that DIGGER is stronger and more generalizable than recently proposed approaches and LLMs. I.e., DIGGER is the best in seen domains and applicable to unseen domains without adaptation. Specifically, the best BLEU-1 score increased from 0.385 to 0.408 on seen domains and boosted to 0.310 on unseen domains.

LLM-based Open Domain Planning by Leveraging Entity-Attribute-Level Domain Models

Cross-prompt trait scoring task aims to learn generalizable scoring capabilities from source- prompt data, enabling automatic scoring across multiple dimensions on unseen essays. Existing research on cross-prompt trait essay scoring primarily focuses on improving model generalization by obtaining prompt-invariant representations. In this paper, we approach the research problem from a different perspective on invariance learning and propose a scoring-invariant learning objective. This objective encourages the model to focus on intrinsic information within the essay that reflects its quality during training, thereby learning generic scoring features. To further enhance the model's ability to score across multiple dimensions, we introduce a trait feature extraction network based on routing gates into the scoring architecture and propose a trait consistency scoring objective to encourage the model to balance the diversity of trait-specific features with scoring consistency across traits when learning trait-specific essay features. Extensive experiments demonstrate the effectiveness of our approach, showing advantages in multi-trait scoring performance and achieving significant improvements with low-resource prompts.

Improving Prompt Generalization for Cross-prompt Essay Trait Scoring from the Scoring-invariance Perspective

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, they remain vulnerable to semantic inconsistency, where minor, semantically equivalent variations in input formatting result in divergent predictions. Our comprehensive evaluation reveals that this brittleness persists even in state-of-the-art models such as GPT-4o, posing a serious challenge to their reliability. Through a mechanistic analysis, we attribute this phenomenon to deep representational failures, whereby semantic-equivalent input changes induce instability in the model’s internal representations. We further examine standard mitigation strategies and uncover their fundamental limitations. In particular, even direct fine-tuning on format variations frequently fails to yield format-invariant semantic representations, highlighting the difficulty of the problem. By explaining the failure of existing methods through our representational diagnosis, we underscore the need for representation-aware strategies to achieve robust and reliable LLM behavior.

When Format Changes Meaning: Investigating Semantic Inconsistency of Large Language Models

Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.

ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts

The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities. The source code, prompts, Blender scenarios, and rendered clips are available at https://anonymous.4open.science/r/eHMI_action_design/.

Automating eHMI Action Design with LLMs for Automated Vehicle Communication

Large Language Model (LLM)-based self-refinement methods have significantly enhanced data analysis performance, especially in correcting errors for Text-to-SQL. However, their effectiveness diminishes when addressing SQL semantic errors, since LLM hallucinations cause persistent biases in the semantic understanding of questions, leading to an uncorrectable situation. To solve this problem, we propose \textbfTest-driven \textbfSelf-refinement for Text-to-SQL \textbf(TS-SQL). It leverages a collaborative LLM agent framework to automatically synthesize high-quality test cases, including test data and test code. The test cases are further employed to provide execution feedback for LLM self-refinement towards SQL semantic errors. Rigorous evaluation shows the superiority of TS-SQL: for BIRD-dev, TS-SQL improves at least \textbf6\\% over existing SQL self-refinement methods; for Spider-dev, TS-SQL identifies and corrects \textbf131 gold SQL errors, exposing system flaws in benchmark rigor. For reproducibility, we release the modified Spider-dev benchmark to foster further research.

TS-SQL: Test-driven Self-refinement for Text-to-SQL

Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.

Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Addressing the intellectual property protection challenges in commercial deployment of large language models (LLMs), existing black-box fingerprinting techniques face dual challenges from incremental fine-tuning erasure and feature-space defense due to their reliance on overfitting high-perplexity trigger patterns. We firstly reveal that, model editing in the fingerprint domain exhibits unique advantages including significantly lower false positive rates, enhanced harmlessness, and superior robustness. Building on this foundation, this paper innovatively proposes a textbfPrefix-textbfenhanced Fingerprint textbfEditing Framework (PREE), which encodes copyright information into parameter offsets through dual-channel knowledge edit to achieve covert embedding of fingerprint features. Experimental results demonstrate that the proposed solution achieves the 90% trigger precision in mainstream architectures including LLaMA-3 and Qwen-2.5. The minimal parameter offset (change rate < 0.03) effectively preserves original knowledge representation while demonstrating strong robustness against incremental fine-tuning and multi-dimensional defense strategies, maintaining zero false positive rate throughout evaluations.

PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement

Large Reason Models (LRMs) extend long reasoning process to solve complex tasks. However, due to the lack of fine-grained control, they often suffer from overthinking and erroneous reasoning problems, risking accuracy loss. To address this issue, we introduce Reasoning Direction Steering (RDS) to enable fine-grained control over LRMs' reasoning behaviors by aligning reasoning trajectories with specific cognitive patterns. We develop a simple yet effective paradigm, Thinking Intervention, which explores two key dimensions - intervention positions and intervention styles - to achieve integration intervention throughout model reasoning processes. To validate the effectiveness of our approach, we conduct comprehensive experiments on multi-hop question answering tasks using state-of-the-art LRMs, including Qwen3-Series and R1-Series models. Experimental results demonstrate the efficacy of Thinking Intervention with 9.4% average improvement on R1-Series models and 1.9% improvement on Qwen3-Series models.

Downloads

Next from EMNLP 2025

Dynamic Evaluation for Oversensitivity in LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES