China

Test-time scaling large language models (LLMs), such as DeepSeek-R1 and OpenAI&#39;s o1, enhances reasoning by extending inference-time chain-of-thought traces. However, their legal reasoning capabilities remain underexplored. We conduct the first systematic evaluation of 10 LLMs --- including both reasoning and general-purpose models --- across 17 Chinese and English legal benchmarks covering statutory and case-law traditions. To bridge the domain gap, we curate a chain-of-thought-annotated legal corpus and train Legal-R1-14B, an open-source legal specialist model. Legal-R1-14B outperforms both o1-preview and DeepSeek-R1 on several benchmarks, establishing a new baseline for legal reasoning. Error analysis reveals ongoing challenges such as outdated legal knowledge, reasoning failures, and factual hallucinations, highlighting key directions for future work in legal-domain LLMs.

EMNLP 2025

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

Test-time scaling large language models (LLMs), such as DeepSeek-R1 and OpenAI's o1, enhances reasoning by extending inference-time chain-of-thought traces. However, their legal reasoning capabilities remain underexplored. We conduct the first systematic evaluation of 10 LLMs --- including both reasoning and general-purpose models --- across 17 Chinese and English legal benchmarks covering statutory and case-law traditions. To bridge the domain gap, we curate a chain-of-thought-annotated legal corpus and train Legal-R1-14B, an open-source legal specialist model. Legal-R1-14B outperforms both o1-preview and DeepSeek-R1 on several benchmarks, establishing a new baseline for legal reasoning. Error analysis reveals ongoing challenges such as outdated legal knowledge, reasoning failures, and factual hallucinations, highlighting key directions for future work in legal-domain LLMs.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

A conflict of interest (COI) appears when a person or a company has two or more interests that may directly conflict. This happens, for instance, when a scientist whose research is funded by a company audits the same company. For transparency and to avoid undue influence, public repositories of relations of interest are becoming recommended or mandatory in various areas, and can be used to avoid COIs. In this work, we propose an LLM-based open information extraction (OpenIE) framework for extracting financial or other types of interesting relations from scientific text. We target scientific publications in which authors declared funding sources or collaborations in the acknowledgment section, or in the metadata, or in the publication, following editors’ requirements. We propose an extraction methodology, an evaluation methodology, and a taxonomy of relations. Finally, we perform a comparative study of disclosures in two journals in the field of toxicology and pharmacology.

The Search for Conflicts of Interest: Open Information Extraction in Scientific Publications

In specialized domains such as space science and utilization, question answering (QA) systems are required to perform complex multi-fact reasoning over sparse knowledge graphs (KGs). Existing KG-based retrieval-augmented generation (RAG) frameworks often face challenges such as inefficient subgraph retrieval, limited reasoning capabilities, and high computational costs. These issues limit their effectiveness in specialized domains. In this paper, we propose SKRAG, a novel Skeleton-guided RAG framework for knowledge graph question answering (KGQA). SKRAG leverages a lightweight language model enhanced with the Finite State Machine (FSM) constraint to produce structurally grounded reasoning skeletons, which guide accurate subgraph retrieval. The retrieved subgraph is then used to prompt a general large language model (LLM) for answer generation. We also introduce SSUQA, a KGQA dataset in the space science and utilization domain. Experiments show that SKRAG outperforms strong baselines on SSUQA and two general-domain benchmarks, demonstrating its adaptability and practical effectiveness.

SKRAG: A Retrieval-Augmented Generation Framework Guided by Reasoning Skeletons over Knowledge Graphs

Multi-agent systems (MAS) powered by large language models (LLMs) have shown potential in tackling multifaceted problems through advanced understanding and reasoning. However, they struggle to adapt to evolving task dependencies and to handle uncertainties, such as shifting priorities or unpredictable disruptions. These constraints undermine their ability to dynamically adjust long-term strategies and inter-agent collaboration. To address these challenges, we propose DeMAC, a Dynamic Environment-Aware Manager-Player Agents Coordination framework that enhances multi-agent coordination through long-term strategic planning. DeMAC uses a dynamically updated directed acyclic graph (DAG) and a Manager-Player Dual-Feedback mechanism to align strategic and operational decisions. Moreover, DeMAC enables agents to maintain collaboration and dynamically adapt to changing environmental conditions, outperforming traditional reinforcement learning and human-agent collaboration in the Overcooked simulation. Experimental results highlight DeMAC’s ability to tackle complex coordination tasks, demonstrating its potential to advance LLM-based MAS in dynamic, complex task dependency environments.

DeMAC: Enhancing Multi-Agent Coordination with Dynamic DAG and Manager-Player Feedback

Dialects exhibit a substantial degree of lexical variation due to the lack of a standard orthography. At the same time, Large Language Models’ (LLMs) ability to process dialects remains largely understudied. To address this gap, we conduct a fine-grained analysis of dialect variation across different parts-of-speech. Using Bavarian as a case study, we investigate the lexical dialect understanding capability of LLMs by examining how they recognize and translate dialectal terms. To this end, we introduce DiaLemma, a novel annotation framework for obtaining dialect variation dictionaries from monolingual data only, and use it to create a ground truth dataset of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can recognize Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our evaluation reveals that LLMs are better at translating and recognizing nouns. Surprisingly, when used as dialect word translation models, we find that providing additional context in the form of example usages can boost their performance. Our results highlight the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.

Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

As machine learning (ML) application continues to expand across diverse fields, there is a rising demand for ML code generation. In this paper, we aim at a critical research question: Can machines autonomously generate ML code for sophisticated, human-designed algorithms or solutions? To answer this question, we introduce a novel benchmark, MLAlgo-Bench, which includes two challenging tasks: 1) Generating code for ML algorithms including both traditional ML and modern deep learning-based methods, and 2) Giving humans solution sketches, writing ML code for solving practical tasks in Kaggle competitions. This benchmark is unique in its focus on the challenges of interpreting intricate human instructions and producing multi-step, high-complexity code, offering a rigorous test for current Large Language Model (LLM) capabilities. We introduce an automatic evaluation framework with comprehensive metrics such as task pass rate, relative performance metric, and time overhead. Currently, the top-performing models (Claude3.5-Sonet) achieve a 48.8% task completion rate on realizing machine learning algorithms, and a 21.6% rate for completing Kaggle competitions. Further analysis suggests substantial room for improvement.

MLAlgo-Bench: Can Machines Implement Machine Learning Algorithms?

Minecraft, as an open-world virtual interactive environment, has become a prominent platform for research on agent decision-making and execution. Existing works primarily adopt a single Large Language Model (LLM) agent to complete various in-game tasks. However, for complex tasks requiring lengthy sequences of actions, single-agent approaches often face challenges related to inefficiency and limited fault tolerance. Despite these issues, research on multi-agent collaboration remains scarce. In this paper, we propose CausalMACE, a holistic causality planning framework designed to enhance multi-agent systems, in which we incorporate causality to manage dependencies among subtasks. Technically, our proposed framework introduces two modules: an overarching task graph for global task planning and a causality-based module for dependency management, where inherent rules are adopted to perform causal intervention. Experimental results demonstrate our approach achieves state-of-the-art performance in multi-agent cooperative tasks of Minecraft. The code will be open-sourced upon the acceptance of this paper.

CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors—especially batch size—remains underexplored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during backpropagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights. Our experiments validate these theoretical insights.

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

Large language models (LLMs) have achieved remarkable success in generative tasks, yet they often fall short in ensuring the factual accuracy of their outputs this limiting their reliability in real-world applications where correctness is critical. In this paper, we present FactReasoner, a novel neuro-symbolic based factuality assessment framework that employs probabilistic reasoning to evaluate the truthfulness of long-form generated responses. FactReasoner decomposes a response into atomic units, retrieves relevant contextual information from external knowledge sources, and models the logical relationships (e.g., entailment, contradiction) between these units and their contexts using probabilistic encodings. It then estimates the posterior probability that each atomic unit is supported by the retrieved evidence. Our experiments on both labeled and unlabeled benchmark datasets demonstrate that FactReasoner often outperforms state-of-the-art prompt-based methods in terms of factual precision and recall.

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present **RadialRouter**, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named **RadialFormer** to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2\% and 5.8\% in the *Balance* and *Cost First* scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.

RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

Multi-modal Dialogue Summarization (MDS) is an important task with great applications. To develop and improve the MDS model, a strong automatic evaluation method for MDS could save a lot of money and time. However, a meta-evaluation benchmark with human annotation is a foundation for developing and improving the automatic evaluation methods of MDS. The shortage of such a benchmark motivates us to introduce MDSEval, the first meta-evaluation benchmark for MDS, providing data-summary pairs and human annotation on summary quality across 8 aspects. Besides the novel benchmark dataset, we propose a novel filtering framework based on Mutually Exclusive Key Information (MEKI) across modalities used in enhancing our data quality. Further, our work is the first to define key evaluation aspects for MDS tasks. Our findings reveal that current multi-modal evaluation methods struggle to fairly rate summaries generated by advanced MLLMs. Our datasets, filtering methods, defined evaluation aspects, and findings will greatly benefit the development of MDS evaluation methods

Downloads

Next from EMNLP 2025

The Search for Conflicts of Interest: Open Information Extraction in Scientific Publications

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES