China

We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce ``more communicative&#39;&#39; text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.

EMNLP 2025

Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)

We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce ``more communicative'' text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model’s robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization—specifically, name-nationality bias and political framing bias—without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

Black-box verifiers for abstractive summaries often struggle with complex claims that require multi-hop reasoning, and they typically provide a single verdict without an interpretable rationale. As a result, it becomes difficult to understand or audit their failures. We address this with HalluTree, a framework that models verification as an interpretable claim tree. HalluTree first decomposes summaries into subclaims, classifying each into two types -- extractive (directly verifiable against evidence) or inferential (requiring reasoning) -- which follow distinct verification paths. Extractive claims are robustly verified against evidence using an ensemble of lightweight NLI models. Crucially, inferential claims trigger a process that generates a natural program -- an explicit reasoning chain that integrates supporting evidence and logical steps -- which is then executed to determine the claim's validity. Evaluation on the LLM-AggreFact benchmark demonstrates HalluTree's effectiveness: it achieves performance competitive with top-tier black-box models, including Bespoke-MiniCheck, while providing transparent and auditable reasoning programs for every inferential judgment. This combination of competitive accuracy and high interpretability offers a significant advance over opaque, single-classification verifiers. We will publically release code, data, prompts, and other artifacts upon acceptance.

HalluTree: Explainable Multi-Hop Hallucination Detection for Abstractive Summarization

Topic models represent topics as ranked term lists, which are often hard to interpret in scientific domains. We explore Topic Description for Scientific Corpora, an approach to generating structured summaries for topic-specific document sets. We propose and investigate two LLM-based pipelines: Selective Context Summarisation (SCS), which uses maximum marginal relevance to select representative documents; and Compressed Context Summarisation (CCS), a hierarchical approach that compresses document sets through iterative summarisation. We evaluate both methods using SUPERT and multi-model LLM-as-a-Judge across three topic modeling backbones and three scientific corpora. Our preliminary results suggest that SCS tends to outperform CCS in quality and robustness, while CCS shows potential advantages on larger topics. Our findings highlight interesting trade-offs between selective and compressed strategies for topic-level summarisation in scientific domains. We release code and data for two of the three datasets.

From Keyterms to Context: Exploring Topic Description Generation in Scientific Corpora

Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.

DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

Individuals express diverse opinions, a fair summary should represent these viewpoints comprehensively.
Previous research on fairness in opinion summarisation using large language models (LLMs) relied on hyperparameter tuning or providing ground truth distributional information in prompts. However, these methods face practical limitations: end-users rarely modify default model parameters, and accurate distributional information is often unavailable. Building upon cognitive science research demonstrating that frequency-based representations reduce systematic biases in human statistical reasoning by making reference classes explicit and reducing cognitive load, this study investigates whether frequency framed prompting (REFER) can similarly enhance fairness in LLM opinion summarisation. Through systematic experimentation with different prompting frameworks, we adapted techniques known to improve human reasoning to elicit more effective information processing in language models compared to abstract probabilistic representations. Our results demonstrate that REFER enhances fairness in language models when summarising opinions. 
This effect is particularly pronounced in larger language models and using stronger reasoning instructions.

REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting

Text summarization helps users manage information overload, but traditional methods can be cumbersome when seeking specific details within a document. Aspect-based text summarization addresses this by using a query to guide which information should be summarized. However, distinguishing relevant from irrelevant information for a given aspect remains challenging in LLM-based summarization models. In this work, we propose utilizing contrastive learning to encourage LLMs to focus on aspect-related signals during training. We further design two variants of the learning algorithm, aspect-anchored and summary-anchored, corresponding to the strategies used in constructing negative examples. Evaluation with two representative LLM families (Llama 2 and Pythia) and two benchmark datasets (AnyAspect and CovidET) demonstrates the proposed methods’ strong performance compared to their supervised fine-tuning and zero-shot counterparts, highlighting contrastive learning as a promising direction for aspect-based text summarization.

Improving Aspect-Based Summarization via Contrastive Learning with Anchored Negative Examples

While there have been many studies analyzing the ability of LLMs to solve problems through reasoning, their application of reasoning in summarization remains largely unexamined. This study explores whether reasoning is essential to summarization by investigating three questions: (1) Do humans frequently use reasoning to generate new summary content? (2) Do summarization models exhibit the same reasoning patterns as humans? (3) Should summarization models integrate more complex reasoning abilities? Our findings reveal that while human summaries often contain reasoning-based information, system-generated summaries rarely contain this same information. This suggests that models struggle to effectively apply reasoning, even when it could improve summary quality. We advocate for the development of models that incorporate deeper reasoning and abstractiveness, and we release our annotated data to support future research.

Beyond Paraphrasing: Analyzing Summarization Abstractiveness and Reasoning

Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates vary across CS pairs and LLMs, with some LLMs showing more frequent errors on certain language pairs, underscoring the need for specialized training on code-switched data.

CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

Dialogue summarization is still a very challenging task even for large language models (LLMs). On the one hand, some previous approaches have pre-trained language models specifically for dialogue understanding and summarization, but they have been limited to relatively small models. On the other hand, other works have tried to directly exploit the dialogue semantics and discourse structures in their modeling effort, but by construction, they require access to those structures, which is in itself a largely unsolved problem. In this paper, we synergistically combine these two ideas in an approach that can be seamlessly integrated into the decoder-only architecture adopted by the most state-of-the-art LLMs. In particular, our novel solution leverages the parameter-efficient fine-tuning (PEFT) paradigm to model the hierarchical structure of dialogues, where input sequences are naturally segmented into dialogue turns, and then fine-tune the model for abstractive summarization. From experiments on two datasets, we find that Hierarchical Attention Adapter outperforms all baseline adapter methods on SummScreen, where our approach can also be combined with LoRA to achieve the best performance on SamSum.

Hierarchical Attention Adapter for Abstractive Dialogue Summarization

Large language models (LLMs) such as GPT-4, Claude and LLaMA are routinely used to evaluate long-form text generated by language models. We study the ability of these models to identify low quality texts, an increasingly rare subset of output which is of great interest to pinpoint during development. We present experiments with a panel of LLM judges, and crowd-sourced approximations of reference judgments. Pinpointing sub-par outputs is a difficult task for both models and crowdworkers, with models doing overall better. Moreover, unlike findings in prior work on factoid question answering, panels of cheaper models do not agree as well with high quality developer judgments of low quality as panels of frontier models. We present qualitative and quantitative analysis of the relative strengths of models in the panel, gleaning insights why they yield better results over a single model.

Premium content

Downloads

Next from EMNLP 2025

AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES