China

The existing assessments of planning capabilities of large language models (LLMs) remain largely limited to single-language or specific representation formats. To address this gap, we introduce the Multi-Plan benchmark comprising 204 multilingual and multi-format travel planning scenarios. In experimental results obtained with state-of-the-art LLMs, the Multi-Plan benchmark effectively highlights the performance disparities among models, notably showing superior results for reasoning-specialized models. Interestingly, language differences exhibited minimal impact, whereas mathematically structured representations significantly improved planning accuracy for most models, underscoring the crucial role of the input format. These findings enhance our understanding of planning abilities of LLMs, offer valuable insights for future research, and emphasize the need for more sophisticated AI evaluation methods. This dataset is publicly available at https://github.com.

EMNLP 2025

Can LLMs Truly Plan? A Comprehensive Evaluation of Planning Capabilities

planning

reasoning

benchmark

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Document alignment is necessary for the hierarchical mining, which aligns documents across source and target languages within the same web domain. Several high-precision sentence embedding-based methods have been developed, such as TK-PERT and Optimal Transport (OT). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models.

BiMax: Bidirectional MaxSim Score for Document-Level Alignment

Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Ge‘ez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Ge‘ez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer dataset will be publicly available under the Open Data licenses to support further research in low-resource, morphologically rich languages.

MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages

Modern language models are evaluated on large benchmarks. Given how many different numbers these evaluations output, making sense of them for model selection can be difficult. We take a closer look at this using a model-centric lens and look at the evaluation numbers themselves. In this work, we analyze benchmarks in three stages: dataset & model comparison, representative set identification, and performance prediction. Since datasets and models relate strongly to one another, we develop an algorithm to identify a representative set of datasets that covers a benchmark using the raw evaluation scores alone. Using our algorithm, we find that with 5.9% (1/17), 1.7% (1/58), and 16.2% (12/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95%. Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error. Taken together, our analysis can help model developers improve efficiency and allow dataset creators validate whether their newly created dataset differs from existing datasets in the benchmark.

SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone

Text embeddings play an important role in NLP but are costly to store and use. Compressing embeddings addresses these challenges, but selecting the best compression methods remains difficult. Existing evaluation methods for compressed embeddings are either expensive or too simplistic. We introduce a new intrinsic evaluation framework with multiple task-agnostic metrics, including a novel spectral fidelity measure called \textbf{EOS } that is resilient to embedding anisotropy. We tested on a set of embeddings across four tasks. Our framework shows that intrinsic metrics reliably predict downstream performance and reveal how different models rely on local versus global structure. This provides a practical, efficient, and interpretable alternative to standard evaluations for compressed embeddings\footnote{We will release the framework to the public. This will save researchers significant time.}.

Do We Really Need All Those Dimensions? An Intrinsic Evaluation Framework for Compressed Embeddings

Table understanding is a crucial task in document processing and is commonly encountered in practical applications. We introduce 2Columns1Row, the first open-source benchmark for the table question answering task in Russian. This benchmark evaluates the ability of models to reason about the relationships between rows and columns in tables, employing both textual and multimodal inputs. 2Columns1Row consists of six datasets, 28,680 tables, designed datasets that vary in the complexity of the text within the table contents and the consistency of the values in the cells. We evaluate the models using text-only and multimodal approaches and analyze their performance. Through extensive evaluation, we demonstrate the limitations of current multimodal models on this task and prove the feasibility of a dynamic text-based system utilizing our benchmark. Our results highlight significant opportunities for advancing table understanding and reasoning, providing a solid foundation for future research in this domain.

2Columns1Row: A Russian Benchmark for Textual and Multimodal Table Understanding and Reasoning

Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter‐efficient fine‐tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer‐wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought—System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)—we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x fewer training steps to achieve the baselines’ performance, highlighting the substantial potential of SampleMix to optimize pre-training data. The code and data are available at https://anonymous.4open.science/r/SampleMix-910C, and can also be found in the Software part of ARR page.

SampleMix: A Sample-wise Pre-training Data Mixing Strategy by Coordinating Data Quality and Diversity

Test-time scaling large language models (LLMs), such as DeepSeek-R1 and OpenAI's o1, enhances reasoning by extending inference-time chain-of-thought traces. However, their legal reasoning capabilities remain underexplored. We conduct the first systematic evaluation of 10 LLMs --- including both reasoning and general-purpose models --- across 17 Chinese and English legal benchmarks covering statutory and case-law traditions. To bridge the domain gap, we curate a chain-of-thought-annotated legal corpus and train Legal-R1-14B, an open-source legal specialist model. Legal-R1-14B outperforms both o1-preview and DeepSeek-R1 on several benchmarks, establishing a new baseline for legal reasoning. Error analysis reveals ongoing challenges such as outdated legal knowledge, reasoning failures, and factual hallucinations, highlighting key directions for future work in legal-domain LLMs.

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

A conflict of interest (COI) appears when a person or a company has two or more interests that may directly conflict. This happens, for instance, when a scientist whose research is funded by a company audits the same company. For transparency and to avoid undue influence, public repositories of relations of interest are becoming recommended or mandatory in various areas, and can be used to avoid COIs. In this work, we propose an LLM-based open information extraction (OpenIE) framework for extracting financial or other types of interesting relations from scientific text. We target scientific publications in which authors declared funding sources or collaborations in the acknowledgment section, or in the metadata, or in the publication, following editors’ requirements. We propose an extraction methodology, an evaluation methodology, and a taxonomy of relations. Finally, we perform a comparative study of disclosures in two journals in the field of toxicology and pharmacology.

The Search for Conflicts of Interest: Open Information Extraction in Scientific Publications

In specialized domains such as space science and utilization, question answering (QA) systems are required to perform complex multi-fact reasoning over sparse knowledge graphs (KGs). Existing KG-based retrieval-augmented generation (RAG) frameworks often face challenges such as inefficient subgraph retrieval, limited reasoning capabilities, and high computational costs. These issues limit their effectiveness in specialized domains. In this paper, we propose SKRAG, a novel Skeleton-guided RAG framework for knowledge graph question answering (KGQA). SKRAG leverages a lightweight language model enhanced with the Finite State Machine (FSM) constraint to produce structurally grounded reasoning skeletons, which guide accurate subgraph retrieval. The retrieved subgraph is then used to prompt a general large language model (LLM) for answer generation. We also introduce SSUQA, a KGQA dataset in the space science and utilization domain. Experiments show that SKRAG outperforms strong baselines on SSUQA and two general-domain benchmarks, demonstrating its adaptability and practical effectiveness.

Downloads

Next from EMNLP 2025

BiMax: Bidirectional MaxSim Score for Document-Level Alignment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES