China

LLMs have been widely adopted to tackle many traditional NLP tasks. Their effectiveness remains uncertain in scenarios where pre-trained models have limited prior knowledge of a language. In this work, we examine LLMs&#39; generalization in under-resourced settings through the task of orthographic normalization across Otomi language variants. We develop two approaches: a rule-based method using a finite-state transducer (FST) and an in-context learning (ICL) method that provides the model with string transduction examples. We compare the performance of FSTs and neural approaches in low-resource scenarios, providing insights into their potential and limitations. Our results show that while FSTs outperform LLMs in zero-shot settings, ICL enables LLMs to surpass FSTs, stressing the importance of combining linguistic expertise with machine learning in current approaches for low-resource scenarios

EMNLP 2025

FSTs vs ICL: Generalisation in LLMs for an under-resourced language

LLMs have been widely adopted to tackle many traditional NLP tasks. Their effectiveness remains uncertain in scenarios where pre-trained models have limited prior knowledge of a language. In this work, we examine LLMs' generalization in under-resourced settings through the task of orthographic normalization across Otomi language variants. We develop two approaches: a rule-based method using a finite-state transducer (FST) and an in-context learning (ICL) method that provides the model with string transduction examples. We compare the performance of FSTs and neural approaches in low-resource scenarios, providing insights into their potential and limitations. Our results show that while FSTs outperform LLMs in zero-shot settings, ICL enables LLMs to surpass FSTs, stressing the importance of combining linguistic expertise with machine learning in current approaches for low-resource scenarios

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose **T**oken-level **T**ool-use **P**reference **A**lignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose **T**oken-level **P**reference **S**ampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the **E**rror-oriented **S**coring **M**echanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.

TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

We present SMARTMiner, a framework for extracting and evaluating SMART (Specific, Measurable, Attainable, Relevant, Time-bound) goals from unstructured health coaching (HC) notes. Developed in response to challenges observed during a clinical trial, the SMARTMiner achieves two tasks: (i) extracting behavior-change goal spans, and (ii) categorizing their SMARTness. We introduce SMARTSpan, the first publicly available dataset of 173 HC notes annotated with 266 goals and SMART attributes. SMARTMiner incorporates an extractive goal retriever with a component-wise SMART classifier. Experiment results show that extractive models significantly outperformed their generative counterparts in low-resource settings, and that two-stage fine-tuning substantially boosted performance. The classifier achieved up to 0.91 SMART F1 score, while the full SMARTMiner maintained high end-to-end accuracy. This work bridges healthcare, behavioral science, and natural language processing to support health coaches and clients with structured goal tracking--paving way for automated weekly goal reviews between human-led HC sessions. Code and the dataset will be released upon acceptance.

SMARTMiner: Extracting and Evaluating SMART Goals from Low-Resource Health Coaching Notes

Retrieval-Augmented Generation (RAG) has significantly mitigated the hallucinations of Large Language Models (LLMs) by grounding the generation with external knowledge. Recent extensions of RAG to graph-based retrieval offer a promising direction, leveraging the structural knowledge for multi-hop reasoning. However, existing graph RAG typically decouples retrieval and reasoning processes, which prevents the retriever from adapting to the reasoning needs of the LLM. They also struggle with scalability when performing multi-hop expansion over large-scale graphs, or depend heavily on annotated ground-truth entities, which are often unavailable in open-domain settings. To address these challenges, we propose a novel graph retriever trained end-to-end with LLM, which features an attention-based growing and pruning mechanism, adaptively navigating multi-hop relevant entities while filtering out noise. Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together, thereby enhancing its reasoning capability and facilitating interactive joint training of the graph retriever and the LLM reasoner. Experimental results across three QA benchmarks show that our approach consistently achieves state-of-the-art performance, validating the strength of joint graph–LLM optimization for complex reasoning tasks. Notably, our framework eliminates the need for predefined ground-truth entities by directly optimizing the retriever using LLM logits as implicit feedback, making it especially effective in open-domain settings.

GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models

Automatic understanding of figures in scientific papers is challenging since they often contain sub-figures and sub-captions in complex layouts. In this paper, we propose a vision-language model to extract aligned pairs of sub-figures and sub-captions from scientific papers. We further create a carefully curated dataset of 7,174 compound figures with annotated sub-figure bounding boxes and aligned sub-captions. Our experiments show that the proposed method outperforms the prior strong vision models on figure detection average precision by 2.3% and improves caption extraction by an absolute of 46.5% in BLEU compared with Llama-2-13B.

FigEx: Aligned Extraction of Scientific Figures and Captions

Preference optimization is a standard approach to fine-tuning large language models to align with human preferences. The quantity, diversity, and representativeness of the preference dataset are critical to the effectiveness of preference optimization. However, obtaining a large amount of preference annotations is difficult in many applications. This raises the question of how to use the limited annotation budget to create an effective preference dataset. To this end, we propose Annotation-Efficient Preference Optimization (AEPO). Instead of exhaustively annotating preference over all available response texts, AEPO selects a subset of responses that maximizes diversity and representativeness from the available responses and then annotates preference over the selected ones. In this way, AEPO focuses the annotation budget on labeling preferences over a smaller but informative subset of responses. We evaluate the performance of preference learning using AEPO on three datasets and show that it outperforms the baselines with the same annotation budget.

Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts

In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model’s reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. All the codes will be released.

DecisionFlow: Advancing Large Language Model as Principled Decision Maker

With the widespread application of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), enhancing their performance has become a research hotspot. This paper presents a novel multi-prompt ensemble decoding approach designed to bolster the generation quality of LLMs by leveraging the aggregation of outcomes from multiple prompts. Given a unique input X, we submit n variations of prompts with X to LLMs in batch mode to decode and derive probability distributions. For each token prediction, we calculate the ensemble probability by averaging the n probability distributions within the batch, utilizing this aggregated probability to generate the token. This technique is dubbed Inner-Batch Ensemble. To facilitate efficient batch inference, we implement a Left-Padding strategy to maintain uniform input lengths across the n prompts. Through extensive experimentation on diverse NLP tasks, including code generation, text simplification and machine translation, we demonstrate the efficacy of our method in enhancing LLM performance. The results show substantial improvements in pass@k rates, LENS metrics and BLEU scores over conventional methods.

M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models

Existing multilingual benchmarks focus primarily on language understanding tasks. There is a lack of benchmarks to measure comprehensive critical capabilities of large language models (LLMs) across diverse languages, including instruction following, reasoning, code generation, and long context understanding. To bridge this gap, we develop BenchMAX, a multi-way multilingual benchmark to evaluate LLMs' general abilities across many languages. BenchMAX consists of high-quality data samples annotated by native annotators in 17 languages covering 10 diverse tasks. Extensive experiments on BenchMAX reveal uneven utilization of core capabilities across languages, emphasizing the performance gaps that scaling model size alone does not resolve. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code will be publicly accessible.

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations—either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context—and may require costly model retraining. We present FAEDKV (Frequency‐Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV's superiority over existing methods by up to 22%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.

FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets.

Downloads

Next from EMNLP 2025

TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES