China

In this work, we compile textbf\texttt{DroidCollection}, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop textbf\texttt{DroidDetect}, a suite of encoder-only detectors trained using a multi-task objective over textttDroidCollection. Our experiments show that existing detectors&#39; performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.

EMNLP 2025

textttDroid: A Resource Suite for AI-Generated Code Detection

llms. code-lms

ai-generated content detection

In this work, we compile textbf\texttt{DroidCollection}, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop textbf\texttt{DroidDetect}, a suite of encoder-only detectors trained using a multi-task objective over textttDroidCollection. Our experiments show that existing detectors' performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The Mixture of Experts (MoE) architecture improves large language models (LLMs) by utilizing sparsely activated expert sub-networks with a routing module, but it typically demands high training cost. Previous work introduces parameter-efficient fine-tuning (PEFT) modules, e.g., LoRA, to achieve a lightweight MoE for training efficiency. However, they construct static experts by manually splitting the LoRA parameters into fixed groups, which limits flexibility and dynamism. Furthermore, this manual partitioning also hinders the effective utilization of well-initialized LoRA modules. To address the challenges, we first delve into the parameter patterns in LoRA modules, revealing that there exists task-relevant parameters that are concentrated along the rank dimension of the LoRA parameters. Based on this, we redesign the construction of experts and propose the method LoRACoE (LoRA Composition of Experts). Specifically, when confronted with a task, it dynamically builds experts based on rank-level parameter composition, i.e., experts can flexibly combine rank-level parameters in LoRA module. Extensive experiments demonstrate that compared to other LoRA-based MoE methods, our method achieves better task performance across a broader range of tasks.

LoRACoE: Improving Large Language Model via Composition-based LoRA Expert

Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce DDOT (Discrete Diffusion with Optimal Transport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.

Flexible-length Text Infilling for Discrete Diffusion Models

As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain textitwhy one model outperforms another. In this work, we use textbfmodel diffing, a mechanistic interpretability approach, to analyze the specific capability differences between textbfGemma-2-9b-it and a textbfSimPO-enhanced variant. Using textbfcrosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8\%), multilingual capabilities (+43.8\%), and instruction-following (+151.7\%), while its additional training also reduces emphasis on model self-reference (-44.1\%) and hallucination management (-68.5\%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.

Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

We introduce the first dataset that includes both lexical complexity prediction (LCP) annotations and lexical simplification (LS) for Romanian, along with a comparison of lexical simplification methods for this language. The LCP annotations were collected from young adult participants using different corpora, including 569 human-translated samples from English, 1,765 samples from the Representative Corpus of Romanian, and 1,587 samples from a diverse set of Romanian texts. We propose a methodology for ordering simplification candidates using pairwise ranking approximation and explore several novel pipelines for complexity prediction and simplification. These efforts result in the development of the first text simplification system for Romanian.

RALS: Resources and Baselines for Romanian Automatic Lexical Simplification

The widespread adoption of large language models (LLMs) has increased the need for reliable AI-text detection. While current detectors perform well on benchmark datasets, we identify a critical vulnerability: increasing the temperature parameter during inference significantly reduces detection accuracy. Based on this weakness, we propose TempParaphraser, a simple yet effective paraphrasing framework that simulates high-temperature sampling effects through multiple normal-temperature generations, effectively evading detection. Experiments show that TempParaphraser reduces detector accuracy by an average of 97.3% while preserving high text quality. We also demonstrate that training on TempParaphraser-augmented data improves detector robustness. All resources are publicly available to support future research.

TempParaphraser: "Heating Up" Text to Evade AI-Text Detection through Paraphrasing

Recent works in Natural Language Inference (NLI) and related tasks employ atomic fact decomposition to enhance interpretability and robustness, yet existing methods rely on resource-intensive large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we introduce SYRP, a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in-distribution and significantly improves robustness to shallow heuristic biases compared to models based purely on extractive rationale supervision. Our findings show that fine-grained interpretability and robust generalization in NLI can be efficiently realized using encoder-only architectures and synthetic rationales.

Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass

Efficient resume parsing is critical for global hiring, yet the absence of dedicated benchmarks for evaluating large language models (LLMs) on multilingual, structure-rich resumes hinders progress. To address this, we introduce ResumeBench, the first privacy-compliant benchmark comprising 2,500 synthetic resumes spanning 50 templates, 30 career fields, and 5 languages. These resumes are generated through a human-in-the-loop pipeline that prioritizes realism, diversity, and privacy compliance, which are validated against real-world resumes. This paper evaluates 24 state-of-the-art LLMs on ResumeBench, revealing substantial variations in handling resume complexities. Specifically, top-performing models like GPT-4o exhibit challenges in cross-lingual structural alignment while smaller models show inconsistent scaling effects. Code-specialized LLMs underperform relative to generalists, while JSON outputs enhance schema compliance but fail to address semantic ambiguities. Our findings underscore the necessity for domain-specific optimization and hybrid training strategies to enhance structural and contextual reasoning in LLMs.

Beyond Human Labels: A Multi-Linguistic Auto-Generated Benchmark for Evaluating Large Language Models on Resume Parsing

With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich knowledge and formatting in human written documents. In this paper, we propose a novel **A**utomatic **I**terative **R**efinement (**AIR**) framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs' ability to follow complex instructions. The AIR framework consists of two stages: 1) Generate an initial instruction from a document; 2) Iteratively refine instructions with LLM-as-judge guidance by comparing the model's output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model’s ability to follow complex instructions, outperforming existing methods for instruction generation.

AIR: Complex Instruction Generation via Automatic Iterative Refinement

Two key capabilities of language models (LMs) include encoding prior knowledge about entities, which enables them to answer queries like "What's the official language of Austria?", and adapting to new information provided in-context, e.g., "Pretend the official language of Austria is Tagalog." In this work, we present the family of targeted persuasion scores (TPS), designed to measure how persuasive a context is to an LM. Compared to evaluating persuasiveness based on a model's decoded answer to a query, the TPS family of measures offers a more fine-grained view of model behavior. Based on the Wasserstein distance, the TPS family of measures captures how much a context can shift a model from its original answer distribution toward a target answer distribution and, furthermore, can flexibly adapt to leverage relationships between possible answers for more meaningful measures. Empirically, we demonstrate that analyzing model behavior with the TPS can reveal more subtle aspects of model behavior that would otherwise remain hidden when only observing a model's decoded answer, e.g., in how contradictory in-context information influences a model. Through the TPS, we offer a way to more carefully measure the effect that a context has on a language model.

How Persuasive Is Your Context?

De-identification in the healthcare setting is an application of NLP where automated algorithms are used to remove personally identifying information of patients (and, sometimes, providers). With the recent rise of generative large language models (LLMs), there has been a corresponding rise in the number of papers that apply LLMs to de-identification. Although these approaches often report near-perfect results, significant challenges concerning reproducibility and utility of the research papers persist. This paper identifies three key limitations in the current literature: inconsistent reporting metrics hindering direct comparisons, the inadequacy of traditional classification metrics in capturing errors which LLMs may be more prone to (i.e., altering clinically relevant information), and lack of manual validation of automated metrics which aim to quantify these errors. To address these issues, we first present a survey of LLM-based de-identification research, highlighting the heterogeneity in reporting standards. Second, we evaluated a diverse set of models to quantify the extent of inappropriate removal of clinical information. Next, we conduct a manual validation of an existing evaluation metric to measure the removal of clinical information, employing clinical experts to assess their efficacy. We highlight poor performance and describe the inherent limitations of such metrics in identifying clinically significant changes. Lastly, we propose a novel methodology for the detection of clinically relevant information removal.

Downloads

Next from EMNLP 2025

LoRACoE: Improving Large Language Model via Composition-based LoRA Expert

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES