China

Large language models (LLMs) make it easy to rewrite a text in any style -- e.g. to be more polite, persuasive, or more positive -- but evaluation thereof is not straightforward. A challenge lies in measuring content preservation: that content not attributable to style change is retained. This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer, focusing on content preservation. We find that meta-evaluation on existing datasets leads to misleading conclusions about the suitability of metrics for content preservation. Widely used metrics show a high correlation to human judgment despite being deemed unsuitable for the task -- unsuitable because they do not abstract from style changes when evaluating content preservation. We show that the overly high correlation to human judgment stems from the data samples used for testing. To address this issue in meta-evaluation, we introduce a new, challenging test set specifically designed for evaluating content preservation metrics in style transfer. Using this dataset, we demonstrate that suitable metrics on content preservation for style transfer indeed are style-aware. To support efficient evaluation, we propose a new style-aware method that utilises small language models (1B,3B), obtaining a higher alignment with human judgement than prompting a model of a similar size as a judge.

EMNLP 2025

Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

In dialogue intent detection, the challenge of acquiring sufficient corpora and the high cost of manual annotation often lead to incorrectly labeled or unrepresentative samples, which can hinder the generalization ability of classification models. Additionally, as using large language models for generating synthetic samples for data augmentation becomes more common, these synthetic samples may exacerbate the problem by introducing additional noise due to the models’ limited prior knowledge. To address this challenge, this paper proposes an interpretable Sample Filter by Topic Modeling (SFTM) framework. By evaluating the diversity and authenticity of the samples, SFTM effectively reduces the quantity of real and synthetic samples while improving the performance of the classification models.

From Noise to Clarity: Filtering Real and LLM-Generated Samples for Enhanced Intent Detection

The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.

Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems

Retrieval-augmented generation (RAG) with foundation models has demonstrated notable capabilities across various tasks and domains, showcasing advanced problem-solving skills in natural language understanding, generation, and complex decision-making processes. However, their proficiency in addressing expert-level reasoning, such as solving mathematical physics problems, remains relatively unexplored. This paper investigates the potential of RAG techniques to solve Olympic-level mathematical physics problems, motivated by students' natural inclination to reference past problems when preparing for competitions. We propose PhOPile, a high-quality, multimodal, physics-specific dataset tailored for Olympic-level challenges. In this dataset, problems from 2021 serve as the evaluation set, while data from other years form the knowledge base for RAG. Additionally, we benchmark RAG methods with foundation models to report and analyze results that can inform future research. Additionally, we highlight the significant role of reflection through experimental results. Furthermore, we conduct a high-granularity evaluation of the performance of popular LLMs and large multimodal models (LMMs) on our dataset, assessing their physics reasoning and RAG capabilities.

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose FiRST, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during the prefill stage) decides which layers will be skipped during decoding. FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware. FiRST is model-agnostic and can be easily enabled on any pre-trained LLM. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks. Extensive experiments show that FiRST significantly reduces latency while outperforming other layer selection strategies in quality metics. It retains competitive performance to base model (without layer skipping) and in some cases, even improves upon it. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.

FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-based translation paradigm, compositional translation, to replace naive few-shot MT with similarity-based demonstrations. An LLM is used to decompose a sentence into simpler phrases, and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs. Our intuition is that this approach should improve translation because these shorter phrases should be intrinsically easier to translate and easier to match with relevant examples. This is especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We show that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES200, NTREX 128 and TICO-19. Code and outputs will be made freely available.

Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation

As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models' language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization.

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains distinct challenges due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models.

NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities

Large language models (LLMs) are increasingly used to support clinical practice. Medical coding is a process where experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new reasoning-based framework for medical coding with LLMs that implements official coding guidelines for human coders. It is the first solution to support the full US ICD-10 classification space (+70K labels), and it achieves comparable results with fine-tuned classifier models on rare classes. However, the performance on frequent classes is still lacking, and we also find that some medical conditions are systematically undercoded. These limitations suggest that LLM-based medical coding should assist human coders rather than automate the task.

Code Like Humans: A Multi-Agent Solution for Medical Coding

Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

In collaborative creation tasks, people iteratively steer an artifact towards a specific target by refining it over multiple rounds of interaction. In contrast, while generative AI excels at creating artifacts in one turn, it often struggles with making refinements in subsequent exchanges. To close this gap, we present mrCAD, a dataset of humans playing a multi-turn game communicating using natural language and drawing. In each game, a pair of players created and refined computer-aided designs (CADs) over multiple turns to match a target design. Only the Designer could see the target, and they must instruct the Maker using multiple modalities of text and drawing at each turn. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. Analysis finds that players relied more on text in refinement than in initial generation instructions, and used different linguistic elements for refinement than for generation. We also find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay the foundation for modeling multi-turn, multimodal communication not captured in prior datasets.

Premium content

Downloads

Next from EMNLP 2025

From Noise to Clarity: Filtering Real and LLM-Generated Samples for Enhanced Intent Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES