China

Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley–Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly hinder the efficiency and scalability of RL pipelines. To this end, we propose a novel reward model framework for RL-based Text-to-SQL named Graph-Reward-SQL, which employs the GMNScore outcome reward model. We leverage SQL graph representations to provide accurate reward signals while significantly reducing time cost and GPU memory usage. Building on this foundation, we further introduce StepRTM, a stepwise reward model that provides intermediate supervision over Common Table Expression (CTE) subqueries. This encourages both functional correctness and readability of SQL. Extensive comparative and ablation experiments on standard benchmarks, including Spider and BIRD, demonstrate that our method consistently outperforms existing reward m

EMNLP 2025

Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent Large Reasoning Models (LRMs) have demonstrated the ability to generate long chains of thought (LongCoT) before arriving at a final conclusion. Despite remarkable breakthroughs in complex reasoning capabilities, LongCoT still faces challenges such as redundancy and logical incoherence. To address these issues, we aim to equip large language models (LLMs) with rigorous and concise logical reasoning capabilities. In this work, we propose Logic-Thinker, a neural-symbolic reasoning framework that employs symbolic solvers to precisely solve problems and transforms their internal solving processes into concise and rigorous chains of thought, referred to as ThinkerCoT. Our experimental results demonstrate that Logic-Thinker achieves state-of-the-art performance in logical reasoning problems. Additionally, LLMs fine-tuned with ThinkerCoT outperform models distilled from QwQ32B on logic reasoning tasks, achieving an overall accuracy improvement of 3.6% while reducing token output by 73%-91%. Furthermore, ThinkerCoT enhances the comprehensive reasoning capabilities of LLMs, as evidenced by performance improvements on reasoning benchmarks such as GPQA and AIME.

Logic-Thinker: Teaching Large Language Models to Think more Logically.

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.

ACEBench: A Comprehensive Evaluation of LLM Tool Usage

Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning attack involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker’s target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work by introducing RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs’ activations when generating poisoned responses versus correct responses. Our results on multiple benchmarks and RAG architectures show our approach can achieve a 98% true positive rate, while maintaining a false positive rate close to 1%.

RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis

The existing assessments of planning capabilities of large language models (LLMs) remain largely limited to single-language or specific representation formats. To address this gap, we introduce the Multi-Plan benchmark comprising 204 multilingual and multi-format travel planning scenarios. In experimental results obtained with state-of-the-art LLMs, the Multi-Plan benchmark effectively highlights the performance disparities among models, notably showing superior results for reasoning-specialized models. Interestingly, language differences exhibited minimal impact, whereas mathematically structured representations significantly improved planning accuracy for most models, underscoring the crucial role of the input format. These findings enhance our understanding of planning abilities of LLMs, offer valuable insights for future research, and emphasize the need for more sophisticated AI evaluation methods. This dataset is publicly available at https://github.com.

Can LLMs Truly Plan? A Comprehensive Evaluation of Planning Capabilities

Document alignment is necessary for the hierarchical mining, which aligns documents across source and target languages within the same web domain. Several high-precision sentence embedding-based methods have been developed, such as TK-PERT and Optimal Transport (OT). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models.

BiMax: Bidirectional MaxSim Score for Document-Level Alignment

Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Ge‘ez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Ge‘ez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer dataset will be publicly available under the Open Data licenses to support further research in low-resource, morphologically rich languages.

MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages

Modern language models are evaluated on large benchmarks. Given how many different numbers these evaluations output, making sense of them for model selection can be difficult. We take a closer look at this using a model-centric lens and look at the evaluation numbers themselves. In this work, we analyze benchmarks in three stages: dataset & model comparison, representative set identification, and performance prediction. Since datasets and models relate strongly to one another, we develop an algorithm to identify a representative set of datasets that covers a benchmark using the raw evaluation scores alone. Using our algorithm, we find that with 5.9% (1/17), 1.7% (1/58), and 16.2% (12/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95%. Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error. Taken together, our analysis can help model developers improve efficiency and allow dataset creators validate whether their newly created dataset differs from existing datasets in the benchmark.

SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone

Text embeddings play an important role in NLP but are costly to store and use. Compressing embeddings addresses these challenges, but selecting the best compression methods remains difficult. Existing evaluation methods for compressed embeddings are either expensive or too simplistic. We introduce a new intrinsic evaluation framework with multiple task-agnostic metrics, including a novel spectral fidelity measure called \textbf{EOS } that is resilient to embedding anisotropy. We tested on a set of embeddings across four tasks. Our framework shows that intrinsic metrics reliably predict downstream performance and reveal how different models rely on local versus global structure. This provides a practical, efficient, and interpretable alternative to standard evaluations for compressed embeddings\footnote{We will release the framework to the public. This will save researchers significant time.}.

Do We Really Need All Those Dimensions? An Intrinsic Evaluation Framework for Compressed Embeddings

Table understanding is a crucial task in document processing and is commonly encountered in practical applications. We introduce 2Columns1Row, the first open-source benchmark for the table question answering task in Russian. This benchmark evaluates the ability of models to reason about the relationships between rows and columns in tables, employing both textual and multimodal inputs. 2Columns1Row consists of six datasets, 28,680 tables, designed datasets that vary in the complexity of the text within the table contents and the consistency of the values in the cells. We evaluate the models using text-only and multimodal approaches and analyze their performance. Through extensive evaluation, we demonstrate the limitations of current multimodal models on this task and prove the feasibility of a dynamic text-based system utilizing our benchmark. Our results highlight significant opportunities for advancing table understanding and reasoning, providing a solid foundation for future research in this domain.

2Columns1Row: A Russian Benchmark for Textual and Multimodal Table Understanding and Reasoning

Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter‐efficient fine‐tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer‐wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought—System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)—we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

Downloads

Next from EMNLP 2025

Logic-Thinker: Teaching Large Language Models to Think more Logically.

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES