China

Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model&#39;s deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.

EMNLP 2025

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at \url{https://github.com/ccqaofficial/ccqa_official.git}.

CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Text-Centric Visual Question Answering (TEC-VQA) is a critical research area that requires semantic interactions between objects and scene texts. However, most existing TEC-VQA benchmarks focus on high-resource languages like English and Chinese. Although few works expanding multilingual QA pairs in non-text-centric VQA datasets through translation, which encounters a substantial “visual-textual misalignment” problem when applied to TEC-VQA. Moreover, the open-source nature of these benchmarks and the broad sources of training data for MLLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging TEC-VQA benchmark called Text-Centric Visual Question Answering in Multilingual Chinese Minority Languages(TVQACML), which involves eight languages, including Standard Chinese, Korean, and six minority languages. TVQACML supports a wide range of tasks, such as Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER), featuring 32,000 question-answer pairs across 8,000 images. Extensive experiments on TVQACML across multiple MLLMs demonstrate the effectiveness of evaluating the MLLMs and enhancing multilingual TEC-VQA performance with fine-tuning.

TVQACML: Benchmarking Text-Centric Visual Question Answering in Multilingual Chinese Minority Languages

Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduces SciNLP—a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing dataset show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://anonymous.4open.science/r/SciNLP-47E5/.

SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP

Identifying/retrieving relevant statutes and prior cases/precedents for a given legal case are common tasks exercised by legal practitioners. Researchers till date have addressed the two tasks independently, thus developing completely different datasets and models for each of the task, however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this resource paper, we address this gap. We propose IL-PCSR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statue Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM based re-ranking approach that gives the best performance.

IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

Although large language models have enhanced automated travel planning abilities, current systems remain misaligned with real-world scenarios. First, they assume users provide explicit queries, while in reality requirements are often implicit. Second, existing solutions ignore diverse environmental factors and user preferences, limiting the feasibility of plans. Third, systems can only generate plans with basic POI arrangements, failing to provide all-in-one plans with rich details. To mitigate these challenges, we construct a novel dataset RETAIL, which supports decision-making for implicit queries while covering explicit queries, both with and without revision needs. It also enables environmental awareness to ensure plan feasibility under real-world scenarios, while incorporating detailed POI information for all-in-one travel plans. Furthermore, we propose a topic-guided multi-agent framework, termed TGMA. Our experiments reveal that even the strongest existing model achieves merely a 1.0% pass rate, indicating real-world travel planning remains extremely challenging. In contrast, TGMA demonstrates substantially improved performance 2.72%, offering promising directions for real-world travel planning.

RETAIL: Towards Real-world Travel Planning for Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong performance in image-text alignment, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models—encompassing both open-source and proprietary architectures—reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.

MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

The availability of suitable learner corpora is crucial for studying second language acquisition (SLA) and language transfer. However, curating such corpora is challenging, as high-quality learner data is rarely publicly available. As a result, only a few learner corpora, such as ICLE and TOEFL-11, are accessible to the research community. To address this gap, we present Anonymous, a novel English learner corpus with longitudinal data. The corpus consists of 687 texts written by adult learners taking English as a second language courses in the USA. These learners are either preparing for university admission or enhancing their language proficiency while beginning their university studies. Unlike most learner corpora, Anonymous includes longitudinal data, allowing researchers to explore language learning trajectories over time. The corpus features contributions from speakers of 15 different L1s. We demonstrate the utility of Anonymous through two case studies at the intersection of SLA and Computational Linguistics: (1) Native Language Identification (NLI), and (2) a quantitative and qualitative analysis of linguistic features influenced by L1 using large language models

Tracing L1 Interference in English Learner Writing: A Longitudinal Corpus with Error Annotations

Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Multi-hop complex reasoning over incomplete knowledge graphs has been extensively studied, but research on numerical knowledge graphs remains relatively limited. Recent approaches focus on separately encoding entities and numerical values, using neural networks to process query encodings for reasoning. However, in complex multi-hop reasoning tasks, numerical values are not merely symbols; they carry specific semantics and logical relationships that must be accurately represented. % Directly encoding numerical values often leads to the loss of such semantic information. In this work, we propose a Complex Numerical Reasoning with Numerical Semantic Pre-Training Framework \textbf{(CNR-NST)}. % Specifically, we designed a joint link predictor to learn numerical semantics. The proposed framework is the first to enable binary operations on numerical attributes in numerical knowledge graphs, allowing new numerical attributes to be inferred from existing knowledge. The CNR-NST framework can perform binary operations on numerical attributes in numerical knowledge graphs, enabling it to infer new numerical attributes from existing knowledge. Our approach effectively handles up to 102 types of complex numerical reasoning queries. On three public datasets, CNR-NST demonstrates SOTA performance in complex numerical queries, achieving an average improvement of over 40% compared to existing methods. Notably, this work expands the range of query types for complex multi-hop numerical reasoning and introduces a new evaluation metric for numerical answers, which has been validated through comprehensive experiments.

Complex Numerical Reasoning with Numerical Semantic Pre-training Framework

This paper introduces essential resources for Qur'anic studies: an annotated Tafsir ontology, a dataset of approximately 4,200 question-answer pairs, and a collection of 15 structured Tafsir books available in two formats. We present a comprehensive framework for handling sensitive Qur'anic Tafsir data that spans the entire pipeline from dataset construction through evaluation and error analysis. Our work establishes new benchmarks for retrieval and question-answering tasks on Qur'anic content, comparing performance across state-of-the-art embedding models and large language models (LLMs). We introduce OntologyRAG-Q, a novel retrieval-augmented generation approach featuring our custom Ayat-Ontology chunking method that segments Tafsir content at the verse level using ontology-driven structure. Benchmarking reveals strong performance across various LLMs, with GPT-4 achieving the highest results, followed closely by ALLaM. Expert evaluations show our system achieves 69.52\% accuracy and 74.36\% correctness overall, though multi-hop and context-dependent questions remain challenging. Our analysis demonstrates that answer position within documents significantly impacts retrieval performance, and among eleven evaluation metrics tested, BERT-recall and BERT-F1 correlate most strongly with expert assessments. All resources developed in this study will be publicly available at \url{https://github.com/OntologyRAG-Q/OntologyRAG-Q.git}.

Premium content

Downloads

Next from EMNLP 2025

CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES