China

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines.

EMNLP 2025

Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification

explanation faithfulness

uncertainty

calibration

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific layers. Experiments show that this circuit is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.

All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens

Misgendering is the act of referring to someone by a gender that does not match their chosen identity. It marginalizes and undermines a person's sense of self, causing significant harm. English-based approaches have clear-cut approaches to avoiding misgendering, such as the use of the pronoun ``they''. However, other languages pose unique challenges due to both grammatical and cultural constructs. In this work we develop methodologies to assess and mitigate misgendering across 42 languages and dialects using a participatory-design approach to design effective and appropriate guardrails across all languages. We test these guardrails in a standard LLM-based application (meeting transcript summarization), where both the data generation and the annotation steps followed a human-in-the-loop approach. We find that the proposed guardrails are very effective in reducing misgendering rates across all languages in the summaries generated, and without incurring loss of quality. Our human-in-the-loop approach demonstrates a method to feasibly scale inclusive and responsible AI-based solutions across multiple languages and cultures. We release the guardrails and synthetic dataset encompassing 42 languages, along with human and LLM-judge evaluations, to encourage further research on this subject.

A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications

In this work, we compile textbf\texttt{DroidCollection}, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop textbf\texttt{DroidDetect}, a suite of encoder-only detectors trained using a multi-task objective over textttDroidCollection. Our experiments show that existing detectors' performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.

textttDroid: A Resource Suite for AI-Generated Code Detection

The Mixture of Experts (MoE) architecture improves large language models (LLMs) by utilizing sparsely activated expert sub-networks with a routing module, but it typically demands high training cost. Previous work introduces parameter-efficient fine-tuning (PEFT) modules, e.g., LoRA, to achieve a lightweight MoE for training efficiency. However, they construct static experts by manually splitting the LoRA parameters into fixed groups, which limits flexibility and dynamism. Furthermore, this manual partitioning also hinders the effective utilization of well-initialized LoRA modules. To address the challenges, we first delve into the parameter patterns in LoRA modules, revealing that there exists task-relevant parameters that are concentrated along the rank dimension of the LoRA parameters. Based on this, we redesign the construction of experts and propose the method LoRACoE (LoRA Composition of Experts). Specifically, when confronted with a task, it dynamically builds experts based on rank-level parameter composition, i.e., experts can flexibly combine rank-level parameters in LoRA module. Extensive experiments demonstrate that compared to other LoRA-based MoE methods, our method achieves better task performance across a broader range of tasks.

LoRACoE: Improving Large Language Model via Composition-based LoRA Expert

Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce DDOT (Discrete Diffusion with Optimal Transport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.

Flexible-length Text Infilling for Discrete Diffusion Models

As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain textitwhy one model outperforms another. In this work, we use textbfmodel diffing, a mechanistic interpretability approach, to analyze the specific capability differences between textbfGemma-2-9b-it and a textbfSimPO-enhanced variant. Using textbfcrosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8\%), multilingual capabilities (+43.8\%), and instruction-following (+151.7\%), while its additional training also reduces emphasis on model self-reference (-44.1\%) and hallucination management (-68.5\%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.

Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

We introduce the first dataset that includes both lexical complexity prediction (LCP) annotations and lexical simplification (LS) for Romanian, along with a comparison of lexical simplification methods for this language. The LCP annotations were collected from young adult participants using different corpora, including 569 human-translated samples from English, 1,765 samples from the Representative Corpus of Romanian, and 1,587 samples from a diverse set of Romanian texts. We propose a methodology for ordering simplification candidates using pairwise ranking approximation and explore several novel pipelines for complexity prediction and simplification. These efforts result in the development of the first text simplification system for Romanian.

RALS: Resources and Baselines for Romanian Automatic Lexical Simplification

The widespread adoption of large language models (LLMs) has increased the need for reliable AI-text detection. While current detectors perform well on benchmark datasets, we identify a critical vulnerability: increasing the temperature parameter during inference significantly reduces detection accuracy. Based on this weakness, we propose TempParaphraser, a simple yet effective paraphrasing framework that simulates high-temperature sampling effects through multiple normal-temperature generations, effectively evading detection. Experiments show that TempParaphraser reduces detector accuracy by an average of 97.3% while preserving high text quality. We also demonstrate that training on TempParaphraser-augmented data improves detector robustness. All resources are publicly available to support future research.

TempParaphraser: "Heating Up" Text to Evade AI-Text Detection through Paraphrasing

Recent works in Natural Language Inference (NLI) and related tasks employ atomic fact decomposition to enhance interpretability and robustness, yet existing methods rely on resource-intensive large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we introduce SYRP, a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in-distribution and significantly improves robustness to shallow heuristic biases compared to models based purely on extractive rationale supervision. Our findings show that fine-grained interpretability and robust generalization in NLI can be efficiently realized using encoder-only architectures and synthetic rationales.

Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass

Efficient resume parsing is critical for global hiring, yet the absence of dedicated benchmarks for evaluating large language models (LLMs) on multilingual, structure-rich resumes hinders progress. To address this, we introduce ResumeBench, the first privacy-compliant benchmark comprising 2,500 synthetic resumes spanning 50 templates, 30 career fields, and 5 languages. These resumes are generated through a human-in-the-loop pipeline that prioritizes realism, diversity, and privacy compliance, which are validated against real-world resumes. This paper evaluates 24 state-of-the-art LLMs on ResumeBench, revealing substantial variations in handling resume complexities. Specifically, top-performing models like GPT-4o exhibit challenges in cross-lingual structural alignment while smaller models show inconsistent scaling effects. Code-specialized LLMs underperform relative to generalists, while JSON outputs enhance schema compliance but fail to address semantic ambiguities. Our findings underscore the necessity for domain-specific optimization and hybrid training strategies to enhance structural and contextual reasoning in LLMs.

Downloads

Next from EMNLP 2025

All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES