China

Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model&#39;s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model&#39;s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

EMNLP 2025

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

persona-driven behavior

mechanistic interpretability

reasoning

Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model's reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model's output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Prompt tuning for Large Language Models (LLMs) is vulnerable to backdoor attacks. The mainstream methods achieve backdoor attacks through prompt tuning on rich training data. However, in real-world data-limited scenarios, these methods have difficulty capturing precise backdoor patterns, leading to weakened backdoor attack capabilities and significant side effects for the LLMs. To alleviate this problem, we propose a enhanced backdoor attacks through contrastive-enhanced machine unlearning in data-limited scenarios, called BCU. Specifically, BCU introduces a multi-objective machine unlearning method to capture precise backdoor patterns by forgetting the association between non-trigger data and the backdoor patterns, reducing side effects. Moreover, we design a contrastive learning strategy to enhance the capturing ability of backdoor patterns, achieving powerful backdoor attacks in data-limited scenarios. Experimental results on 6 NLP datasets and 4 LLMs show that BCU exhibits strong backdoor attack capabilities and slight side effects, whether the training data is rich or limited. Our findings highlight the more practical security risks of backdoor attacks against LLMs, necessitating further research for security purposes.

Forget the Unneeded: Backdooring Large Language Models via Contrastive-enhanced Machine Unlearning

Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., "une sentinelle" - grammatically feminine in French but referring to the stereotypically masculine concept "guard"). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.

Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models

Automated fact-checking aims to assess the truthfulness of textual claims based on relevant evidence. However, verifying complex claims that require multi-hop reasoning remains a significant challenge. We propose **GraphCheck**, a novel framework that transforms claims into entity-relationship graphs for structured and systematic verification. By explicitly modeling both explicit and latent entities and exploring multiple reasoning paths, GraphCheck improves verification robustness. While GraphCheck excels in complex scenarios, it may be unnecessarily elaborate for simpler claims. To address this, we introduce **DP-GraphCheck**, an adaptive variant that employs a lightweight strategy selector to dynamically choose between direct prompting and GraphCheck. This selective mechanism improves both accuracy and efficiency by applying the appropriate level of reasoning per claim. Experiments on the HOVER and EX-FEVER datasets demonstrate that our approach outperforms existing methods, particularly on multi-hop claims. Moreover, the strategy selection mechanism in DP-GraphCheck generalizes well to other fact-checking pipelines, highlighting the versatility of our framework.

GraphCheck: Multipath Fact-Checking with Entity-Relationship Graphs

Political bias in Large Language Models (LLMs) presents a growing concern for the responsible deployment of AI systems. Traditional audits often attempt to locate a model’s political position as a point estimate, masking the broader set of ideological boundaries that shape what a model is willing or unwilling to say. In this paper, we draw upon the concept of the Overton Window as a framework for mapping these boundaries: the range of political views that a given LLM will espouse, remain neutral on, or refuse to endorse. To uncover these windows, we apply an inquiry-based methodology that probes LLMs through task-driven prompts designed to elicit political stances indirectly. Using the Political Compass Test, we evaluate twenty-six LLMs from eight providers to reveal their distinct Overton Windows. While many models default to economically left and socially liberal positions, we show that their willingness to express or reject certain positions varies considerably, where DeepSeek models tend to be very restrictive in what they will discuss and Gemini models to be most expansive. Our findings demonstrate that Overton Windows offer a richer, more nuanced view of political bias in LLMs and provide a new lens for auditing their normative boundaries.

POW: Political Overton Windows of Large Language Models

Reference-free evaluation metrics for grammatical error correction (GEC) have achieved high correlation with human judgments. However, these metrics are not designed to evaluate adversarial systems that aim to obtain unjustifiably high scores. The existence of such systems undermine reliability of automatic evaluation, as it can mislead users in selecting appropriate GEC systems. In this study, we propose adversarial attack strategies for four reference-free metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that our adversarial systems outperform the current state-of-the-art. These findings highlight the need for more robust evaluation methods.

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that "distills" jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.

Jailbreak Distillation: Renewable Safety Benchmarking

Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. In this work, we propose a calibrated log-likelihood-based selection framework to improve multi-LLM performance. Our approach leverages uncertainty estimation to identify the most confident response while minimizing inference costs. We show that our method outperforms majority voting and exceeds self-consistency performance when using a large number of model calls. Through extensive experiments, we demonstrate improvements of approx. 4%, 3%, and 5% on GSM8K, MMLU, and ARC, respectively, when applying uncertainty-aware selection to multi-LLM systems.

Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Detectors of LLM-generated text suffer from poor domain shifts generalization ability. Yet, reliable text detection methods in the wild are of paramount importance for plagiarism detection, integrity of the public discourse, and AI safety. Linguistic and domain confounders introduce spurious correlations, leading to poor out-of-distribution (OOD) performance. In this work we introduce the concept of confounding neurons, individual neurons within transformers-based detectors that encode dataset-specific biases rather than task-specific signals. Leveraging confounding neurons, we propose a novel post-hoc, neuron-level intervention framework to disentangle AI-generated text detection factors from data-specific biases. Through extensive experiments we prove its ability to effectively reduce topic-specific biases, enhancing the model’s ability to generalize across domains.

How to Generalize the Detection of AI-Generated Text: Confounding Neurons

Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best balance reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments illustrate the effectiveness of our approach at aligning to preference proxies, including sentiment control, helpfulness and harmless, and summary quality. Our method obtains +10\% and +3\% win-rate points in summarization and dialogue scenarios, respectively, without compromising the reasoning capabilities of the model, or the relevancy and faithfulness of the summary response.

SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Knowledge editing has emerged as an effective approach for updating large language models (LLMs) by modifying their internal knowledge. However, their application to the biomedical domain faces unique challenges due to the long-tailed distribution of biomedical knowledge, where rare and infrequent information is prevalent. In this paper, we conduct the first comprehensive study to investigate the effectiveness of knowledge editing methods for editing long-tail biomedical knowledge. Our results indicate that, while existing editing methods can enhance LLMs' performance on long-tail biomedical knowledge, their performance on long-tail knowledge remains inferior to that on high-frequency popular knowledge, even after editing. Our further analysis reveals that long-tail biomedical knowledge contains a significant amount of one-to-many knowledge, where one subject and relation link to multiple objects. This high prevalence of one-to-many knowledge limits the effectiveness of knowledge editing in improving LLMs' understanding of long-tail biomedical knowledge, highlighting the need for tailored strategies to bridge this performance gap.

Downloads

Next from EMNLP 2025

Forget the Unneeded: Backdooring Large Language Models via Contrastive-enhanced Machine Unlearning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES