China

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training–task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness—often surpassing superficial similarity between trained data and benchmark—and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.

EMNLP 2025

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

data influence

fine-tuning

robustness

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the \textbf{table-to-report} task and construct a bilingual benchmark named \textbf{T2R-bench}, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as four types of industrial tables. Furthermore, we propose a novel evaluation criteria to fairly measure the quality of report generation. Expeimental results show that Deepseek-R1 only achieves the best performance with \text{62.71%} overall score, indicating that LLMs still have room for improvement on T2R-bench.

T2R-BENCH: A Benchmark for Real World Table-to-Report Task

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. We introduce TCP, a new benchmark that jointly assesses both abilities by framing each instance as a temporal constraint-based planning problem. Each instance features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed. The model must infer an optimal schedule that satisfies all constraints. To construct TCP, we first generate abstract problem prototypes using a Python script that samples from predefined constraint templates and performs exhaustive search to ensure logical consistency. These prototypes are then paired with realistic scenarios from various domains and enriched into full data instances using an LLM. A human quality check is finally performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models struggle with TCP, highlighting its difficulty and revealing limitations in LLMs’ temporal constraint-based planning abilities. We analyze underlying failure cases and hope our findings can inspire future research.

TCP: a Benchmark for Temporal Constraint-Based Planning

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.

Understanding Subword Compositionality of Large Language Models

We show that large language models (LLMs) exhibit an textitinternal chain-of-thought: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world TRACE benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.

Internal Chain-of-Thought: Empirical Evidence for Layer‑wise Subtask Scheduling in LLMs

Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35× average reduction in evaluation time. We publish our benchmark adaptions.

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

Natural Language Inference (NLI) is an important task in natural language processing. NLI models are aimed at automatically determining logical relationships between pairs of sentences. However, recent studies based on gold labels assigned to sentence pairs by human experts have provided some evidence that NLI models tend to make inconsistent model decisions during inference. Previous studies have used existing NLI datasets to test the transitive consistency of language models. However, they test only variations of two transitive consistency rules out of four. To further evaluate the transitive consistency of NLI models, we propose a novel evaluation approach that allows us to test all four rules automatically by generating adversarial examples via antonym replacements. Since we are testing self-consistency, human labeling of generated adversarial examples is unnecessary. Our experiments on several benchmark datasets indicate that the examples generated by the proposed antonym replacement methodology can reveal transitive inconsistencies in the state-of-the-art NLI models.

Transitive self-consistency evaluation of NLI models without gold labels

Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with "non-standard" input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, or recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for LLM performance across linguistic variations.

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.

Linguistic and Embedding-Based Profiling of Texts Generated by Humans and Large Language Models

The surge of user-generated online content presents a wealth of insights into customer preferences and market trends. However, the highly diverse, complex, and context-rich nature of such contents poses significant challenges to traditional opinion mining approaches. To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides extensive (entity, feature, opinion) tuple annotations and a comprehensive opinion-centric summary that highlights key opinion topics within each content, thereby enabling the evaluation of both the extractive and abstractive capabilities of models. Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios. This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.

Can Large Language Models be Effective Online Opinion Miners?

Large language models (LLMs) have demonstrated impressive performance in machine translation, but still struggle with unseen low-resource languages, especially those written in underrepresented scripts. To investigate whether LLMs can translate such languages with the help of linguistic resources, we introduce Lotus, a benchmark designed to evaluate translation for Mongolian (in traditional script) and Yi. Our study shows that while linguistic resources can improve translation quality as measured by automatic metrics, LLMs remain limited in their ability to handle these languages effectively. We hope our work provides insights for the low-resource NLP community and fosters further progress in machine translation for underrepresented languages. Our code and data are available.

Downloads

Next from EMNLP 2025

T2R-BENCH: A Benchmark for Real World Table-to-Report Task

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES