China

Language models, while capable of generating remarkably coherent and seemingly accurate text, can occasionally produce undesirable content including harmful or toxic outputs. In this paper, we present a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for detected undesirable contents, we propose layerwise distributional steering policies that transform the attention heads. These policies are computed through principled semidefinite programming aims to minimally perturb the attention distribution while probabilistically guaranteeing the effectiveness of the editions. Empirical evaluations across multiple language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.

EMNLP 2025

Distributional Surgery for Language Model Activations

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Preference optimization methods like DPO have achieved remarkable performance in LLM alignment. However, the evaluation for these methods relies on a single response and overlooks other potential outputs, which could also be generated in real-world applications within this hypothetical space. To address this issue, this paper presents a Hypothesis-based PrEference-aware AnaLysis Framework (HEAL), a novel evaluation paradigm that formulates preference alignment as a re-ranking process within hypothesis spaces. The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment. To facilitate this framework, we develop UniHypoBench, a unified hypothesis benchmark constructed from diverse instruction-response pairs. Through extensive experiments based on HEAL, with a particular focus on the intrinsic mechanisms of preference learning, we demonstrate that current preference learning methods can effectively capture preferences provided by proxy models while simultaneously suppressing negative samples. These findings contribute to preference learning research through two significant avenues. Theoretically, we introduce hypothesis space analysis as an innovative paradigm for understanding preference alignment. Practically, HEAL offers researchers robust diagnostic tools for refining preference optimization methods, while our empirical results identify promising directions for developing more advanced alignment algorithms capable of comprehensive preference capture.

HEAL: A Hypothesis-Based Preference-Aware Analysis Framework

While reasoning and multilingual capabilities in Language Models (LMs) have achieved remarkable progress in recent years, their integration into a unified paradigm—multilingual reasoning—is at a nascent stage. Multilingual reasoning requires language models to handle logical reasoning across languages while addressing misalignment, biases, and challenges in low-resource settings. This survey provides the first in-depth review of multilingual reasoning in LMs. In this survey, we provide a systematic overview of existing methods that leverage LMs for multilingual reasoning, specifically outlining the challenges, motivations, and foundational aspects of applying language models to reason across diverse languages. We provide an overview of the standard data resources used for training multilingual reasoning in LMs and the evaluation benchmarks employed to assess their multilingual capabilities. Next, we analyze various state-of-the-art methods and their performance on these benchmarks. Finally, we explore future research opportunities to improve multilingual reasoning in LMs, focusing on enhancing their ability to handle diverse languages and complex reasoning tasks.

A Survey of Multilingual Reasoning in Language Models

Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4 times – 9 times reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.

Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling

Testing is essential to modern software engineering for building reliable software. Given the high costs of manually creating test cases, automated test case generation, particularly methods utilizing large language models, has become increasingly popular. These neural approaches generate semantically meaningful tests that are more maintainable compared with traditional automatic testing methods like fuzzing. However, the diversity and volume of unit tests in current datasets are limited, especially for newer but important languages. In this paper, we present a novel data augmentation technique, FuzzAug, that introduces the benefits of fuzzing to large language models by introducing valid testing semantics and providing diverse coverage-guided inputs. Doubling the size of training datasets, FuzzAug improves the performance from the baselines significantly. This technique demonstrates the potential of introducing prior knowledge from dynamic software analysis to improve neural test generation, offering significant enhancements in neural test generation.

FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.

Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check

Large language models (LLMs) are reshaping how scientific knowledge is accessed and represented. This study evaluates the extent to which popular and frontier LLMs including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro recognize scientists, benchmarking their outputs against OpenAlex and Wikipedia. Using a dataset of 100,000 physicists from OpenAlex to evaluate LLM recognition, we uncover substantial disparities: LLMs exhibit selective and inconsistent recognition patterns. Recognition correlates strongly with scholarly impact such as citations, and remains uneven across gender and geography. Women researchers, and researchers from Africa, Asia, and Latin America are significantly underrecognized. We further examine the role of training data provenance, identifying Wikipedia as a potential sources that contributes to recognition gaps. Our findings highlight how LLMs can reflect, and potentially amplify existing disparities in science, underscoring the need for more transparent and inclusive knowledge systems.

Unequal Scientific Recognition in the Age of LLMs

Large Language Models (LLMs) have achieved remarkable success in question answering (QA) tasks, but often encode biases that hinder fairness and reliability. Existing bias mitigation techniques are constrained to predefined bias categories, limiting their ability to address novel or emergent biases. To address this, we introduce Open-DeBias, a comprehensive framework for open-set bias mitigation in QA. We propose a large-scale open-set QA dataset with 473, 602 instances spanning 31 bias categories and 9, 594 subgroups to enable the detection and mitigation of both known and unseen biases. Furthermore, we propose a novel debiasing framework using adapters, a parameter efficient fine-tuning approach to mitigate existing biases while generalizing to the newer ones. Compared to the state-of-the-art BMBI method, Open-DeBias improves QA accuracy on the BBQ dataset by 48.3% and 5.2% for ambiguous and disambiguous cases, respectively. Additionally, with adapters fine-tuned on only 4.27% of total training data, our method obtains 84% accuracy on Korean BBQ, demonstrating strong language-agnostic generalization. These results demonstrate a new direction for scal027 able, adaptive bias mitigation in LLM-based QA without compromising task performance.

Open-DeBias: Toward Mitigating Open-Set Bias in Language Models

This system paper presents the DeMeVa team's approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.

DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning

Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation.
To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding.
We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements.
We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method.
The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not.
Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to 
LeWiDi tasks, and we analyze potential reasons for this gap.

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)

The Learning With Disagreements (LeWiDi) 2025 shared task aims to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, which focuses on modeling individual annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level
and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend DisCo by introducing annotator metadata
embeddings, enhancing input representations, and multi-objective training
losses to capture disagreement patterns better. Through extensive experiments,
we demonstrate substantial improvements in both soft and perspectivist evaluation
metrics across three datasets. We also conduct in-depth calibration and error
analyses that reveal when and why disagreement-aware modeling improves.
Our findings show that disagreement can be better captured by conditioning on
annotator demographics and by optimizing directly for distributional metrics, yielding consistent improvements across datasets.

Premium content

Downloads

Next from EMNLP 2025

HEAL: A Hypothesis-Based Preference-Aware Analysis Framework

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES