China

Large language models improve at mathematical reasoning after instruction tuning, reinforcement learning, or knowledge distillation. However, it is unclear whether these improvements result from major changes in the transformer layers or from minor adjustments that preserve the base model’s layer importance structure. We investigate this question through systematic layer-wise ablation experiments, examining base, instruction-tuned, knowledge-distilled, and reinforcement learning with verifiable rewards (RLVR) trained variants on mathematical reasoning benchmarks. Our findings show that mathematical reasoning gives rise to a specific layer importance structure, and this structure persists across all post-training paradigms. Removing such layers causes accuracy drops of up to 80%. In contrast, non-mathematical tasks like factual recall exhibit no such critical layers. This distinction suggests that mathematical reasoning relies on specialized layers that emerge during pre-training and stay unchanged under various post-training methods, whereas other non-reasoning tasks do not exhibit any critical layers.

EMNLP 2025

Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent progress in Natural Language Processing (NLP) has driven the creation of Large Language Models (LLMs) capable of tackling a vast range of tasks. A critical property of these models is their ability to handle large documents and process long token sequences, which has fostered the need for a robust evaluation methodology for long-text scenarios. To meet this requirement in the context of the Russian language, we present our benchmark consisting of 18 datasets designed to assess LLM performance in tasks such as information retrieval, knowledge extraction, machine reading, question answering, and reasoning. These datasets are categorized into four levels of complexity, enabling model evaluation across context lengths up to 128k tokens. To facilitate further research, we provide open-source datasets, a codebase, and a public leaderboard associated with the benchmark.

Long Context Benchmark for the Russian Language

This study aims to enhance the automatic identification and classification of metadiscourse markers in English texts, evaluating various large language models for the purpose. Metadiscourse is a commonly used rhetorical strategy in both written and spoken language to guide addressees through discourse. Due to its linguistic complexity and dependency on the context, automated metadiscourse classification is challenging. With a hypothesis that LLMs may handle complicated tasks more effectively than supervised machine learning approaches, we tune and evaluate seven encoder language models on the task using a dataset totalling 575,541 tokens and annotated with 24 labels. The results show a clear improvement over supervised machine learning approaches as well as an untuned Llama3.3-70B-Instruct baseline, with XLNet-large achieving an accuracy and F1-score of 0.91 and 0.93, respectively. However, four less frequent categories record F-scores below 0.5, highlighting the need for more balanced data representation.

Enhancing the Automatic Classification of Metadiscourse in Low-Proficiency Learners' Spoken and Written English Texts Using XLNet

The ability to track entities is fundamental for language understanding, yet the internal mechanisms governing this capability in Small Language Models (SLMs) are poorly understood. Previous studies often rely on indirect probing or complex interpretability methods, leaving a gap for lightweight diagnostics that connect model behavior to performance. To bridge this gap, we introduce a framework to analyze entity tracking by measuring the attention flow between entity and non-entity tokens within SLMs. We apply this to analyze models both before and after Parameter-Efficient Fine-Tuning (PEFT). Our analysis reveals two key findings. First, SLMs' attentional strategies vary significantly with text type, but entities consistently receive a high degree of focus. Second, we show that PEFT -- specifically QLoRA -- dramatically improves classification performance on entity-centric tasks by increasing the model's attentional focus on entity-related tokens. Our work provides direct evidence for how PEFT can refine a model's internal mechanisms and establishes attention analysis as a valuable, lightweight diagnostic tool for interpreting and improving SLMs.

Entity Tracking in Small Language Models: An Attention-Based Study of Parameter-Efficient Fine-Tuning

This paper investigates stance detection on Nigerian 2023 election tweets by comparing transformer-based and classical machine learning models. A balanced dataset of 2,100 annotated tweets was constructed, and BERT-base-uncased was fine-tuned to classify stances into Favor, Neutral, and Against. The model achieved 98.1% accuracy on an 80/20 split and an F1-score of 96.9% under 5-fold cross-validation. Baseline models such as Naïve Bayes, Logistic Regression, Random Forest, and SVM were also evaluated, with SVM achieving 97.6% F1. While classical methods remain competitive on curated datasets, BERT proved more robust in handling noisy, sarcastic, and ambiguous text, making it better suited for real-world applications in low-resource African NLP contexts.

Stance Detection on Nigerian 2023 Election Tweets Using BERT: A Low-Resource Transformer-Based Approach

Code-switching (CSW) in speech is motivated by conversational factors across levels of linguistic analysis. While we know much about why speakers code-switch, there remains great scope for exploring how CSW occurs in speech, particularly within the discourse-level linguistic context. We build on prior work by asking: how are patterns of CSW influenced by different conversational contexts spanning Academic, Cultural, Personal, and Professional discourse topics? To answer this, we annotate a Mandarin-English spontaneous speech corpus, and analyze its discourse topics alongside various aspects of CSW production. We show that discourse topics interact significantly with utterance-level CSW, resulting in distinctive patterns of CSW presence, richness, language direction, and syntax that are uniquely associated with different contexts. Our work is the first to take such a context-sensitive approach to studying CSW, contributing to a broader understanding of the discourse topics that motivate speakers to code-switch in diverse ways.

Code-switching in Context: Investigating the Role of Discourse Topic in Bilingual Speech Production

Discourse adverbials are key features of discourse coherence, but their function is often ambiguous. In this work, we investigate how the discourse function of otherwise varies in different contexts. We revise the function set in Rohde et al. (2018b) to account for a new meaning we have encountered. In turn, we create the "otherwise" corpus, a dataset of naturally occurring passages annotated for discourse functions, and identify lexical signals that make a function available with a corpus study. We define continuation acceptability, a metric based on surprisal to probe language models for what they take the function of otherwise to be in a given context. Our experiments show that one can improve function inference by focusing solely on tokens up to and including the head verb of the continuation (i.e., otherwise clause) that have the most varied surprisal across function-disambiguating discourse markers. Lastly, we observe that some of these tokens confirm lexical signals we found in our earlier corpus study, which provides some promising evidence to motivate future pragmatic studies in language models

"Otherwise" in Context: Exploring Discourse Functions with Language Models

Understanding and interpreting culturally specific language remains a significant challenge for multilingual natural language processing (NLP) systems, particularly for less-resourced languages. To address this problem, this paper introduces PRONE, a novel dataset of 2,830 Nepali proverbs, and evaluates the performance of various language models (LMs) in two tasks: (i) identifying the correct meaning of a proverb from multiple choices, and (ii) categorizing proverbs into predefined thematic categories. The models, including both open-source and proprietary, were tested in zero-shot and few-shot settings with prompts in English and Nepali. While models like GPT-4o demonstrated promising results and achieved the highest performance among LMs, they still fall short of human-level accuracy in understanding and categorizing culturally nuanced content, highlighting the need for more inclusive NLP.

Probing the Limits of Multilingual Language Understanding: Low-Resource Language Proverbs as LLM Benchmark for AI Wisdom

Large Language Models (LLMs) have demonstrated remarkable performance across various NLP tasks, yet they continue to face challenges in discourse relation recognition (DRR). Current state-of-the-art methods for DRR primarily rely on smaller pre-trained language models (PLMs). In this study, we conduct a comprehensive analysis of different approaches using both PLMs and LLMs, evaluating their effectiveness for DRR at multiple granularities and under different data availability settings. Our findings indicate that no single approach consistently outperforms the others, and we offer a general comparison framework to guide the selection of the most appropriate model based on specific DRR requirements and data conditions.

Discourse Relation Recognition with Language Models Under Different Data Availability

Consider the example "The bird sang the nursery rhyme beautifully. It made everyone in the room smile". The pronoun 'it' here refers either to the bird or to the event of singing. This example is inherently ambiguous. It cannot be meaningfully disambiguated as an event or entity reference, as both readings result in the same text meaning. This study introduces a new dataset EMBITEXT to preserve ambiguity in the language by navigating through the ambiguity surrounding the pronominal reference to the entity or event. Oftentimes, ambiguity does not necessarily need to be resolved but is modelled carefully. Furthermore, this study explores the capacity of LLMs (Llama, Mistral, Gemini, Claude AI) to embrace ambiguity in generating text that exhibit referential ambiguity via an In-Context learning approach. To evaluate of the dataset, RoBERTa was finetuned on this data to model ambiguity while simultaneously distinguishing between entity or event references. Results demonstrate EmbiText's capacity to advance the ongoing NLP research by modelling linguistic ambiguity in computational environments instead of fully disambiguating it, thereby retaining diverse interpretations where resolution may alter meaning.

EmbiText: Embracing Ambiguity by Annotation, Recognition and Generation of Pronominal Reference with Event-Entity Ambiguity

Understanding the strategies that make expert-led explanations effective is a core challenge in didactics and a key goal for explainable AI. To study this computationally, we introduce ReWIRED, a large corpus of explanatory dialogues annotated by education experts with fine-grained, span-level teaching acts across five levels of explainee knowledge. We use this resource to assess the capabilities of modern language models, finding that while few-shot LLMs struggle to label these acts, fine-tuning is a highly effective methodology. Moving beyond structural annotation, we propose and validate a suite of didactic quality metrics. We demonstrate that a prompt-based evaluation using an LLM as a ``judge'' is required to capture how the functional quality of an explanation aligns with the learner's expertise -- a nuance missed by simpler static metrics. Together, our dataset, modeling insights, and evaluation framework provide a comprehensive methodology to bridge pedagogical principles with computational discourse analysis.

Premium content

Next from EMNLP 2025

Long Context Benchmark for the Russian Language

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES