China

Two key capabilities of language models (LMs) include encoding prior knowledge about entities, which enables them to answer queries like &quot;What&#39;s the official language of Austria?&quot;, and adapting to new information provided in-context, e.g., &quot;Pretend the official language of Austria is Tagalog.&quot; In this work, we present the family of targeted persuasion scores (TPS), designed to measure how persuasive a context is to an LM. Compared to evaluating persuasiveness based on a model&#39;s decoded answer to a query, the TPS family of measures offers a more fine-grained view of model behavior. Based on the Wasserstein distance, the TPS family of measures captures how much a context can shift a model from its original answer distribution toward a target answer distribution and, furthermore, can flexibly adapt to leverage relationships between possible answers for more meaningful measures. Empirically, we demonstrate that analyzing model behavior with the TPS can reveal more subtle aspects of model behavior that would otherwise remain hidden when only observing a model&#39;s decoded answer, e.g., in how contradictory in-context information influences a model. Through the TPS, we offer a way to more carefully measure the effect that a context has on a language model.

EMNLP 2025

How Persuasive Is Your Context?

context persuasiveness

metric design

interpretability

Two key capabilities of language models (LMs) include encoding prior knowledge about entities, which enables them to answer queries like "What's the official language of Austria?", and adapting to new information provided in-context, e.g., "Pretend the official language of Austria is Tagalog." In this work, we present the family of targeted persuasion scores (TPS), designed to measure how persuasive a context is to an LM. Compared to evaluating persuasiveness based on a model's decoded answer to a query, the TPS family of measures offers a more fine-grained view of model behavior. Based on the Wasserstein distance, the TPS family of measures captures how much a context can shift a model from its original answer distribution toward a target answer distribution and, furthermore, can flexibly adapt to leverage relationships between possible answers for more meaningful measures. Empirically, we demonstrate that analyzing model behavior with the TPS can reveal more subtle aspects of model behavior that would otherwise remain hidden when only observing a model's decoded answer, e.g., in how contradictory in-context information influences a model. Through the TPS, we offer a way to more carefully measure the effect that a context has on a language model.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

De-identification in the healthcare setting is an application of NLP where automated algorithms are used to remove personally identifying information of patients (and, sometimes, providers). With the recent rise of generative large language models (LLMs), there has been a corresponding rise in the number of papers that apply LLMs to de-identification. Although these approaches often report near-perfect results, significant challenges concerning reproducibility and utility of the research papers persist. This paper identifies three key limitations in the current literature: inconsistent reporting metrics hindering direct comparisons, the inadequacy of traditional classification metrics in capturing errors which LLMs may be more prone to (i.e., altering clinically relevant information), and lack of manual validation of automated metrics which aim to quantify these errors. To address these issues, we first present a survey of LLM-based de-identification research, highlighting the heterogeneity in reporting standards. Second, we evaluated a diverse set of models to quantify the extent of inappropriate removal of clinical information. Next, we conduct a manual validation of an existing evaluation metric to measure the removal of clinical information, employing clinical experts to assess their efficacy. We highlight poor performance and describe the inherent limitations of such metrics in identifying clinically significant changes. Lastly, we propose a novel methodology for the detection of clinically relevant information removal.

Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss

Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset for LLM-generated medical content. It consists of 1,321 questions and 7,441 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both zero-shot and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research.

MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses

The growing deployment of deep learning models in real-world applications necessitates not only high predictive accuracy, but also mechanism to identify unreliable predictions, especially in high-stakes scenarios where decision risk must be minimized. Existing methods estimate uncertainty by leveraging predictive confidence (e.g., Softmax Response), structural characteristics of representation space (e.g., Mahalanobis distance), or stochastic variation in model outputs (e.g., Bayesian inference techniques such as Monte Carlo Dropout). In this work, we propose a novel uncertainty estimation (UE) framework based on sparse dictionary learning by identifying dictionary atoms associated with misclassified samples. We leverage pointwise mutual information (PMI) to quantify the association between sparse features and predictive failure. Our method -- Sparsity-based Uncertainty Estimation (SUE) -- is computationally efficient, is interpretable via atom-level analysis of the dictionary, has no assumption about the class distribution (like Mahalanobis distance), has a regularization effect which helps identifying the most fundamental blocks within the representation. We evaluated SUE on several NLU (GLUE tasks) and sentiment analysis (Twitter, ParaDetox, and Jigsaw) benchmarks. It performed better or achieved comparable performance to other methods. On average, SUE demonstrated the most consistent results across all benchmarks, achieving a 26% improvement over the second-best performing method (Softmax Response).

SUE: Sparsity-based Uncertainty Estimation via Sparse Dictionary Learning

Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial textbfuniform phase, (ii) a textbfsharp-convergence phase, (iii) a textbfsharp-divergence phase, and (iv) a textbfslow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies, or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.

Convergence and Divergence of Language Models under Different Random Seeds

We present a statistical framework for modeling and controlling large language model (LLM) response lengths using extreme value theory. Analyzing 14,301 GPT-4o responses across temperature and prompting conditions, with cross-validation on Qwen and DeepSeek architectures, we demonstrate that verbosity follows Weibull-type generalized extreme value (GEV) distributions with heavier tails under stochastic generation. Our key contributions include: (1) development of a novel GEV-generalized Pareto (GPD) hybrid model that improves tail fit (R²CDF=0.9993 vs standalone GEV's 0.998) while maintaining architectural generalizability; (2) quantitative characterization of prompt anchoring effects across models, showing reduced dispersion but increased outliers under randomization; and (3) identification of temperature-dependent response patterns that persist across architectures, with higher temperatures amplifying length variability while preserving extreme-value mechanisms. The hybrid model's threshold selection method enables precise verbosity control in production systems regardless of model choice. While validated on multiple architectures, generalizability to emerging model families requires further study.

Analyzing and Modeling LLM Response Lengths with Extreme Value Theory: Anchoring Effects and Hybrid Distributions

The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction following capabilities of foundation models within the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark performance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. Datasets and evaluation integration can be found at [https:028//redacted/for/anon/sub].

AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models

Since the advent of large language models (LLMs), prompt engineering has been a crucial step for eliciting desired responses for various Natural Language Processing (NLP) tasks. However, prompt engineering remains an impediment for end users due to rapid advances in models, tasks, and associated best practices. To mitigate this, Automatic Prompt Optimization (APO) techniques have recently emerged that use various automated techniques to help improve the performance of LLMs on various tasks. In this paper, we present a comprehensive survey summarizing the current progress and remaining challenges in this field. We provide a formal definition of APO, a 5-part unifying framework, and then proceed to rigorously categorize all relevant works based on their salient features therein. We hope to spur further research guided by our framework.

A Systematic Survey of Automatic Prompt Optimization Techniques

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of meta error patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate evaluation protocols and reinforcement learning research.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

The well-known rhetorical framework, ABT (And, But, Therefore), mirrors natural human cognition in structuring an argument's logical progression - apropos to academic communication. However, distilling the complexities of research into clear and concise prose requires careful sequencing of ideas and formulating clear connections between them. This presents a quiet inequitability for contributions from authors who struggle with English proficiency or academic writing conventions. We see this as impetus to introduce: Mondrian, a framework that identifies the key components of an abstract and reorients itself to properly reflect the ABT logical progression. The framework is composed of a deconstruction stage, reconstruction stage, and rephrasing. We introduce a novel metric for evaluating deviation from ABT structure, named EB-DTW, which accounts for both ordinality and a non-uniform distribution of importance in a sequence. Our overall approach aims to improve the comprehensibility of academic writing, particularly for non-native English speakers, along with a complementary metric. The effectiveness of Mondrian is tested with automatic metrics and extensive human evaluation, and demonstrated through impressive quantitative and qualitative results, with organization and overall coherence of an abstract improving by an average of 27.71% and 24.71%.

Downloads

Next from EMNLP 2025

Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES