China

Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter‐efficient fine‐tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer‐wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought—System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)—we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

EMNLP 2025

LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

model partitioning

system 1/2

dual-system learning

lora-par

supervised fine-tuning (sft)

large language models (llms)

peft

reinforcement learning (rl)

efficient training

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x fewer training steps to achieve the baselines’ performance, highlighting the substantial potential of SampleMix to optimize pre-training data. The code and data are available at https://anonymous.4open.science/r/SampleMix-910C, and can also be found in the Software part of ARR page.

SampleMix: A Sample-wise Pre-training Data Mixing Strategy by Coordinating Data Quality and Diversity

Test-time scaling large language models (LLMs), such as DeepSeek-R1 and OpenAI's o1, enhances reasoning by extending inference-time chain-of-thought traces. However, their legal reasoning capabilities remain underexplored. We conduct the first systematic evaluation of 10 LLMs --- including both reasoning and general-purpose models --- across 17 Chinese and English legal benchmarks covering statutory and case-law traditions. To bridge the domain gap, we curate a chain-of-thought-annotated legal corpus and train Legal-R1-14B, an open-source legal specialist model. Legal-R1-14B outperforms both o1-preview and DeepSeek-R1 on several benchmarks, establishing a new baseline for legal reasoning. Error analysis reveals ongoing challenges such as outdated legal knowledge, reasoning failures, and factual hallucinations, highlighting key directions for future work in legal-domain LLMs.

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

A conflict of interest (COI) appears when a person or a company has two or more interests that may directly conflict. This happens, for instance, when a scientist whose research is funded by a company audits the same company. For transparency and to avoid undue influence, public repositories of relations of interest are becoming recommended or mandatory in various areas, and can be used to avoid COIs. In this work, we propose an LLM-based open information extraction (OpenIE) framework for extracting financial or other types of interesting relations from scientific text. We target scientific publications in which authors declared funding sources or collaborations in the acknowledgment section, or in the metadata, or in the publication, following editors’ requirements. We propose an extraction methodology, an evaluation methodology, and a taxonomy of relations. Finally, we perform a comparative study of disclosures in two journals in the field of toxicology and pharmacology.

The Search for Conflicts of Interest: Open Information Extraction in Scientific Publications

In specialized domains such as space science and utilization, question answering (QA) systems are required to perform complex multi-fact reasoning over sparse knowledge graphs (KGs). Existing KG-based retrieval-augmented generation (RAG) frameworks often face challenges such as inefficient subgraph retrieval, limited reasoning capabilities, and high computational costs. These issues limit their effectiveness in specialized domains. In this paper, we propose SKRAG, a novel Skeleton-guided RAG framework for knowledge graph question answering (KGQA). SKRAG leverages a lightweight language model enhanced with the Finite State Machine (FSM) constraint to produce structurally grounded reasoning skeletons, which guide accurate subgraph retrieval. The retrieved subgraph is then used to prompt a general large language model (LLM) for answer generation. We also introduce SSUQA, a KGQA dataset in the space science and utilization domain. Experiments show that SKRAG outperforms strong baselines on SSUQA and two general-domain benchmarks, demonstrating its adaptability and practical effectiveness.

SKRAG: A Retrieval-Augmented Generation Framework Guided by Reasoning Skeletons over Knowledge Graphs

Multi-agent systems (MAS) powered by large language models (LLMs) have shown potential in tackling multifaceted problems through advanced understanding and reasoning. However, they struggle to adapt to evolving task dependencies and to handle uncertainties, such as shifting priorities or unpredictable disruptions. These constraints undermine their ability to dynamically adjust long-term strategies and inter-agent collaboration. To address these challenges, we propose DeMAC, a Dynamic Environment-Aware Manager-Player Agents Coordination framework that enhances multi-agent coordination through long-term strategic planning. DeMAC uses a dynamically updated directed acyclic graph (DAG) and a Manager-Player Dual-Feedback mechanism to align strategic and operational decisions. Moreover, DeMAC enables agents to maintain collaboration and dynamically adapt to changing environmental conditions, outperforming traditional reinforcement learning and human-agent collaboration in the Overcooked simulation. Experimental results highlight DeMAC’s ability to tackle complex coordination tasks, demonstrating its potential to advance LLM-based MAS in dynamic, complex task dependency environments.

DeMAC: Enhancing Multi-Agent Coordination with Dynamic DAG and Manager-Player Feedback

Dialects exhibit a substantial degree of lexical variation due to the lack of a standard orthography. At the same time, Large Language Models’ (LLMs) ability to process dialects remains largely understudied. To address this gap, we conduct a fine-grained analysis of dialect variation across different parts-of-speech. Using Bavarian as a case study, we investigate the lexical dialect understanding capability of LLMs by examining how they recognize and translate dialectal terms. To this end, we introduce DiaLemma, a novel annotation framework for obtaining dialect variation dictionaries from monolingual data only, and use it to create a ground truth dataset of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can recognize Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our evaluation reveals that LLMs are better at translating and recognizing nouns. Surprisingly, when used as dialect word translation models, we find that providing additional context in the form of example usages can boost their performance. Our results highlight the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.

Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

As machine learning (ML) application continues to expand across diverse fields, there is a rising demand for ML code generation. In this paper, we aim at a critical research question: Can machines autonomously generate ML code for sophisticated, human-designed algorithms or solutions? To answer this question, we introduce a novel benchmark, MLAlgo-Bench, which includes two challenging tasks: 1) Generating code for ML algorithms including both traditional ML and modern deep learning-based methods, and 2) Giving humans solution sketches, writing ML code for solving practical tasks in Kaggle competitions. This benchmark is unique in its focus on the challenges of interpreting intricate human instructions and producing multi-step, high-complexity code, offering a rigorous test for current Large Language Model (LLM) capabilities. We introduce an automatic evaluation framework with comprehensive metrics such as task pass rate, relative performance metric, and time overhead. Currently, the top-performing models (Claude3.5-Sonet) achieve a 48.8% task completion rate on realizing machine learning algorithms, and a 21.6% rate for completing Kaggle competitions. Further analysis suggests substantial room for improvement.

MLAlgo-Bench: Can Machines Implement Machine Learning Algorithms?

Minecraft, as an open-world virtual interactive environment, has become a prominent platform for research on agent decision-making and execution. Existing works primarily adopt a single Large Language Model (LLM) agent to complete various in-game tasks. However, for complex tasks requiring lengthy sequences of actions, single-agent approaches often face challenges related to inefficiency and limited fault tolerance. Despite these issues, research on multi-agent collaboration remains scarce. In this paper, we propose CausalMACE, a holistic causality planning framework designed to enhance multi-agent systems, in which we incorporate causality to manage dependencies among subtasks. Technically, our proposed framework introduces two modules: an overarching task graph for global task planning and a causality-based module for dependency management, where inherent rules are adopted to perform causal intervention. Experimental results demonstrate our approach achieves state-of-the-art performance in multi-agent cooperative tasks of Minecraft. The code will be open-sourced upon the acceptance of this paper.

CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors—especially batch size—remains underexplored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during backpropagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights. Our experiments validate these theoretical insights.

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

Large language models (LLMs) have achieved remarkable success in generative tasks, yet they often fall short in ensuring the factual accuracy of their outputs this limiting their reliability in real-world applications where correctness is critical. In this paper, we present FactReasoner, a novel neuro-symbolic based factuality assessment framework that employs probabilistic reasoning to evaluate the truthfulness of long-form generated responses. FactReasoner decomposes a response into atomic units, retrieves relevant contextual information from external knowledge sources, and models the logical relationships (e.g., entailment, contradiction) between these units and their contexts using probabilistic encodings. It then estimates the posterior probability that each atomic unit is supported by the retrieved evidence. Our experiments on both labeled and unlabeled benchmark datasets demonstrate that FactReasoner often outperforms state-of-the-art prompt-based methods in terms of factual precision and recall.

Downloads

Next from EMNLP 2025

SampleMix: A Sample-wise Pre-training Data Mixing Strategy by Coordinating Data Quality and Diversity

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES