China

Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks through chain-of-thought (CoT) reasoning. However, they suffer from high inference latency due to lengthy reasoning chains. In this paper, we propose SpecCoT, a collaborative framework that combines large and small models for effective yet efficient reasoning. Unlike traditional speculative decoding, which operates at the token level, SpecCoT adopts a step-level verification strategy: the large model first establishes the reasoning direction, and for each intermediate step, the small model generates multiple candidate drafts in parallel. The large model then verifies these drafts, either selecting the most suitable one or rejecting them all and generating its own. SpecCoT approach balances reasoning quality with inference efficiency through fine-grained model cooperation. Experiments across diverse tasks show SpecCoT reduces inference latency by 1.7-4times while maintaining comparable accuracy to standard large model inference.

EMNLP 2025

SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration

efficient reasoning

speculative decoding

chain-of-thought

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Molecular optimization—modifying a given molecule to improve desired properties—is a fundamental task in drug discovery. While LLMs hold the potential to solve this task using natural language to drive the optimization, straightforward prompting achieves limited accuracy. In this work, we propose AgentDrug, an agentic workflow that leverages LLMs in a structured refinement process to achieve significantly higher accuracy. AgentDrug defines a nested refinement loop: the inner loop uses feedback from cheminformatics toolkits to validate molecular structures, while the outer loop guides the LLM with generic feedback and a gradient-based objective to steer the molecule toward property improvement. We evaluate AgentDrug on benchmarks with both single- and multi-property optimization under loose and strict thresholds. Results demonstrate significant performance gains over previous methods. With Qwen-2.5-3B, AgentDrug improves accuracy by 20.7% (loose) and 16.8% (strict) on six single-property tasks, and by 7.0% and 5.3% on eight multi-property tasks. With larger model Qwen-2.5-7B, AgentDrug further improves accuracy on 6 single-property objectives by 28.9% (loose) and 29.0% (strict), and on 8 multi-property objectives by 14.9% (loose) and 13.2% (strict).

AgentDrug: Utilizing Large Language Models in an Agentic Workflow for Zero-Shot Molecular Optimization

While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable document structure that could enhance knowledge acquisition and utilization. Motivated by this gap, we propose \textit{\textbf{R}etrieve-\textbf{D}ocument\textbf{R}oute-\textbf{R}ead} (\textbf{RDR\textsuperscript{2}}), a novel framework that explicitly incorporates document structure throughout the RAG process. RDR\textsuperscript{2} employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic behavior curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on three challenging datasets, RDR\textsuperscript{2} achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems' ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.

Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) at inference, utilizing techniques such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). In this paper, we introduce Reward-Guided Test-Time Computing (RTTC), a novel system designed to enhance the downstream accuracy of LLMs on client devices through test-time compute leveraging a remote multi-domain knowledge base through a server-client architecture. At test time, RTTC retrieves relevant samples from the server and performs retrieval-augmented generation or lightweight, adaptive fine-tuning on the client, thereby enhancing downstream performance across diverse domains. To further optimize efficiency, RTTC introduces a Query-State Caching (QSC) mechanism that leverages historical query embeddings, significantly reducing redundant computation and latency. Furthermore, we explore the integration of reward models to dynamically select the optimal TTC strategy—no adaptation, RAG, or fine-tuning—while maximizing performance and maintaining computational efficiency. Extensive experiments demonstrate that RTTC achieves up to 7% improvement in average downstream accuracy and speedup over vanilla RAG or TTT across multiple LLMs and tasks. Our results highlight the potential of RTTC for scalable, high-performance language model adaptation on client devices.

RTTC: Reward-Guided Collaborative Test-Time Compute

We examine evaluation of faithfulness to input data in the context of hotel highlights—brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (r=0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

Real-World Summarization: When Evaluation Reaches Its Limits

Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also be used to efficiently uncover scaling laws that guide the building of LLMs to achieve a desirable performance with significantly better cost-effectiveness.

Uncovering Scaling Laws for Large Language Models via Inverse Problems

Tool usage is a proven technique for developing high-performance reasoning in large language models (LLMs). Our work is focused on emphasizing the utility of leveraging multiple diverse tools for complex reasoning tasks. We present textbfMulti-TAG, a textbfMulti-textbfTool textbfAGgregation-based LLM framework that utilizes multiple diverse tools to solve complex math problems over multiple reasoning steps. At each reasoning step, textbfMulti-TAG invokes multiple tools and accepts the solution of the respective step by tools that have majority agreement on the final answer estimate. textbfMulti-TAG strongly outperforms several standard baselines that use individual tools with the same number of runs, highlighting the importance of multi-tool invocation for solving complex reasoning tasks. We also show that naive aggregation of multiple tools at each reasoning step also leads to substantial improvements of up to 35% accuracy. textbfMulti-TAG then further improves these gains by 7.4% on average on MATH500, AIME, AMC, and OlympiadBench.

Diverse Multi-tool Aggregation with Large Language Models for Enhanced Math Reasoning

Large Reasoning Models (LRMs) have recently gained attention for their strong reasoning capabilities in solving complex problems. Reflective reasoning—revisiting and refining prior thoughts—has been shown to significantly enhance performance. However, excessive reflection can lead to longer outputs, increased inference time and cost, and a higher risk of hallucination. Existing training methods rarely address this trade-off effectively. We propose ReFLAIR, a unified framework for reflective reasoning guided by structured [object Object][object Object][object Object] trajectories and hybrid rewards. First, we construct a high-quality ReFLAIR-cold dataset comprising 30K diverse multimodal reasoning samples annotated with introspective revisions. Then, we train a Reflection Quality Scorer (RQS) to assess the value added by each re-thinking step. Finally, we adopt a modified GRPO-based reinforcement learning approach, which combines answer accuracy, structural compliance, reflection utility, and task difficulty into a hybrid reward for policy optimization. ReFLAIR achieves up to +12.2% accuracy improvement on challenging benchmarks including MathVista, MM-Math, and so on, while consistently enhancing reflection quality and reasoning robustness. Our findings demonstrate that structured reflection offers a scalable and generalizable pathway to improving step-by-step reasoning in multimodal LLMs.

ReFLAIR: Enhancing Multimodal Reasoning via Structured Reflection and Reward-Guided Learning

Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored. We constructed and published two new textual entailment datasets NoFEVER-ML and NoSNLI-ML in four languages (English, Czech, German, and Ukrainian) with textitpaired examples differing in negation. It allows investigation of the root causes of the negation problem and its exemplification: how popular LLM model properties and language impact their inability to handle negation correctly. Contrary to previous work, we show that increasing the model size may improve the models' ability to handle negations. Furthermore, we find that both the models' reasoning accuracy and robustness to negation are language-dependent and that the length and explicitness of the premise have an impact on robustness. There is better accuracy in projective language with fixed order, such as English, than in non-projective ones, such as German or Czech. Our entailment datasets pave the way to further research for explanation and exemplification of the negation problem, minimization of LLM hallucinations, and improvement of LLM reasoning in multilingual settings.

Towards the Roots of the Negation Problem: A Multilingual NLI Dataset and Model Scaling Analysis

Inference constitutes the majority of costs throughout the lifecycle of a large language model (LLM). While numerous LLM inference engines focusing primarily on low-level optimizations have been developed, there is a scarcity of non-intrusive client-side frameworks that perform high-level optimizations. In this paper, we introduce CacheSaver, a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, thereby integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. The key novelty is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating independent and identically distributed responses within a namespace as well as ensuring reproducibility. Moreover, as a direct consequence of operating at a high level, CacheSaver supports both local and online models. We conduct extensive experiments with five representative state-of-the-art reasoning strategies, five diverse benchmark tasks, and three different LLMs. On average across all methods, tasks, and LLMs, CacheSaver reduces cost by approximately 25% and CO2 emissions by approximately 35%. Notably, CacheSaver excels in practical machine learning scenarios such as benchmarking across multiple methods or conducting ablation analysis of a specific method, obtaining substantial cost and carbon footprint reduction of approximately 60%. CacheSaver is publicly available at https://anonymous.4open.science/r/cachesaver-7060.

Downloads

Next from EMNLP 2025

AgentDrug: Utilizing Large Language Models in an Agentic Workflow for Zero-Shot Molecular Optimization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES