China

Reasoning has long been viewed as an emergent property of large language models (LLMs). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. This paper introduces textbfThinkSLM, the first extensive benchmark to systematically evaluate and study the reasoning abilities of SLMs trained from scratch or derived from LLMs through quantization, pruning, and distillation. We first establish a reliable evaluation criterion comparing available methods and LLM judges against our human evaluations. Then we present a study evaluating textbf72 diverse SLMs from textbfsix major model families across textbf17 reasoning benchmarks. We repeat all our experiments textbfthree times to ensure a robust assessment. Our findings show that: textbftextit1) reasoning ability in SLMs is strongly influenced by training methods and data quality rather than solely model scale; textbftextit2) quantization preserves reasoning capability, while pruning significantly disrupts it;textbftextit 3) larger models consistently exhibit higher robustness against adversarial perturbations and intermediate reasoning, but certain smaller models closely match or exceed the larger models&#39; performance. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression.

EMNLP 2025

ThinkSLM: Towards Reasoning in Small Language Models

Reasoning has long been viewed as an emergent property of large language models (LLMs). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. This paper introduces textbfThinkSLM, the first extensive benchmark to systematically evaluate and study the reasoning abilities of SLMs trained from scratch or derived from LLMs through quantization, pruning, and distillation. We first establish a reliable evaluation criterion comparing available methods and LLM judges against our human evaluations. Then we present a study evaluating textbf72 diverse SLMs from textbfsix major model families across textbf17 reasoning benchmarks. We repeat all our experiments textbfthree times to ensure a robust assessment. Our findings show that: textbftextit1) reasoning ability in SLMs is strongly influenced by training methods and data quality rather than solely model scale; textbftextit2) quantization preserves reasoning capability, while pruning significantly disrupts it;textbftextit 3) larger models consistently exhibit higher robustness against adversarial perturbations and intermediate reasoning, but certain smaller models closely match or exceed the larger models' performance. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

While LLMs are widely used for generic tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, and consulting without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology and retrieves a minimal set of hyperedges for a given query using an optimization algorithm. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods. We release the code at [https://anonymous.4open.science/r/ograg-E7A8](https://anonymous.4open.science/r/ograg-E7A8).

OG-RAG: Ontology-grounded retrieval-augmented generation for large language models

Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to study ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit loopholes to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, covering scalar implicature, structural ambiguities, and human-written stories. We then measure different models’ abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-sourced models are often able to identify ambiguities and exploit their resulting loopholes, presenting a potential alignment risk. Our analysis indicates that models that exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals. We release our code and data.

Language Models Identify Ambiguities and Exploit Loopholes

Semantic Overlap Summarization (SOS) is a constrained multi-document summarization task, where the constraint is to capture the common/overlapping information between two alternative narratives. In this work we conduct an evaluation of Large Language Models (LLMs) on the SOS task and introduce introduce the PrivacyPolicyPairs (3P) dataset with the intentions of expanding the space of SOS data in terms of both quantity and variety. With this dataset we provide 135 high quality SOS data samples sourced from privacy policy documents, an alternate domain of text from the original SOS dataset. We then use the TELeR taxonomy to create and evaluate 905,216 LLM generated summaries over our SOS datasets of different domains and we further conduct human evaluation on a subset of 540 samples. We conclude the paper by analyzing model performance and the reliability of automatic evaluation. The code and datasets used to conduct this study are available at https://anonymous.4open.science/r/llm_eval-E16D

Benchmarking LLMs on Semantic Overlap Summarization

Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled data selection strategy that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective “diversity teachers” for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.

Diverse, not Short: A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models

Large Language Models (LLMs) have demonstrated an impressive ability to retrieve and summarize complex information, but their reliability under conflicting contexts remains poorly understood. We introduce an adversarial extension of the Needle-in-a-Haystack framework where three mutually exclusive “needles” are embedded into long documents. By systematically manipulating factors such as position, repetition, layout, and domain relevance, we evaluate how LLMs handle contradictions. We find that models almost always fail to signal uncertainty and instead confidently select a single alternative, exhibiting strong and consistent biases toward repetition, recency, and specific surface form. We further analyze if these patterns are shared within a model family and size, as well as perform both probability-based and generation-based retrieval. Our framework highlights critical limitations in current LLMs’ robustness to contradiction, revealing potential shortcomings in RAG systems' ability to handle noisy or manipulated inputs, and pose challenges for deployment in high-stakes applications.

Conflicting Needles in a Haystack: How LLMs behave when faced with contradictory information

Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 48k LLM-generated poems, and (2) 12k unpublished poems posted in an online community, r/OCPoetry. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the scraping, processing, and tokenizing strategies used to assemble pretraining datasets for LLMs.

so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs

Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors, and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.

The Impact of Automatic Speech Transcription on Speaker Attribution

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document–presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats.

PresentAgent: Multimodal Agent for Presentation Video Generation

We present ROBoto2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBoto2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBoto2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (10035 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.

ROBOTO2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation.

Downloads

Next from EMNLP 2025

OG-RAG: Ontology-grounded retrieval-augmented generation for large language models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES