China

Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection &amp; lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent&#39;s (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.

EMNLP 2025

WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

llm/ai agents

chain-of-thought

applications

Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large reasoning models (LRMs) excel at solving complex tasks by leveraging long chain-of-thought (CoT) reasoning. However, this often leads to overthinking on simple tasks, resulting in unnecessary computational overhead. We observe that LRMs inherently possess the capability for efficient short CoT reasoning, which can be reliably elicited through prompt design. To leverage this capability, we propose ThinkSwitcher, a framework that enables a single LRM to dynamically switch between short and long CoT modes based on task complexity. ThinkSwitcher introduces a lightweight switching module trained with supervision signals derived from the relative performance of each reasoning mode across tasks. Experiments on multiple reasoning benchmarks show that ThinkSwitcher reduces computational cost by 20-30% while maintaining high accuracy on complex tasks. This demonstrates the effectiveness of ThinkSwitcher as a scalable and efficient solution for unified LRM deployment.

ThinkSwitcher: When to Think Hard, When to Think Fast

In competitive programming task, problem statements are often embedded within elaborate narrative backgrounds, requiring deep understanding of the underlying solutions to successfully complete the tasks. Current code generation models primarily focus on token-level semantic modeling, highly susceptible to distractions from irrelevant narrative statements. Inspired by RAG, retrieving reference code with similar solutions may help enhance model performance on difficult problems. However, existing retrieval models also emphasize surface-level semantic similarity, neglecting the deeper solution-level logical similarities that are critical in competitive programming. Therefore, designing ranking models capable of accurately identifying and retrieving problems and corresponding codes remains an urgent research problem in competitive code generation. In this paper, we propose SolveRank, a solution-aware ranking model empowered by synthetic data for competitive programming tasks. Specifically, we leverage the DeepSeek-R1 model to generate logically equivalent but differently phrased new problems, verified by GPT-4o for solution consistency. Then, we train SolveRank with these as positive samples and BM25/random-retrieved problems as negatives. During inference, SolveRank retrieves relevant problems and corresponding code from the corpus to assist a downstream code generator. Experiments on the xCodeEval dataset demonstrate that SolveRank outperforms SOTA ranking methods in precision and recall metrics, and boosts code generation performance for difficult problems.

Beyond the Surface: A Solution-Aware Retrieval Model for Competition-level Code Generation

Recent advancements of general domain oriented Role-playing Agents (RPAs) have enabled the agents to maintain character properties in a wide spectrum of daily tasks beyond mere scenario based chit-chatting. Nonetheless, current works lacks consideration of replicating internal properties of characters like fine-grained memories, and failed to take account of aligning with the knowledge boundary of each character, resulting in degraded personalization and proneness to character hallucination in general domain. To address these problems, we draw inspirations from the context effect theory and propose a retrieval-based framework TailorRPA to harvest tailored general domain instructions to improve integration of fine-grained memories and incorporate general-domain protective queries to help shape the character-wise knowledge boundary, alleviating character hallucination. Based on the framework, we developed a role-playing dataset TailorGen, comprising both role-specific and general-domain instructions. Through empirical experiments, we proved the superiority of TailorRPA in eliciting general domain role-playing capabilities and alleviating character hallucination compared to baseline methods, and explored the existence of character hallucination in state-of-the-art proprietary models through empirical experiments, underlining the importance of our work.

TailorRPA: A Retrieval-Based Framework for Eliciting Personalized and Coherent Role-Playing Agents in General Domain

Recent advancements in Retrieval-Augmented Generation (RAG) have improved large language models (LLMs) by incorporating external knowledge at inference time. Graph-based RAG systems have emerged as promising approaches, enabling multi-hop reasoning by organizing retrieved information into structured graphs. However, when knowledge graphs are constructed from unstructured documents using LLMs, they often suffer from fragmentation—resulting in disconnected subgraphs that limit inferential coherence and undermine the advantages of graph-based retrieval. To address these limitations, we propose ReGraphRAG, a novel framework designed to reconstruct and enrich fragmented knowledge graphs through three core components: Graph Reorganization, Perspective Expansion, and Query-aware Reranking. Experiments on four benchmarks show that ReGraphRAG outperforms state-of-the-art baselines, achieving over 80% average diversity win rate. Ablation studies highlight the key contributions of graph reorganization and especially perspective expansion to performance gains. Our code is available at: https://anonymous.4open.science/r/ReGraphRAG-7B73

ReGraphRAG: Reorganizing Fragmented Knowledge Graphs for Multi-Perspective Retrieval-Augmented Generation

Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real-world financial environments remains underexplored. Existing approaches are confined to historical backtesting, where trading actions cannot influence market prices, and agents train on static data. To overcome this limitation, we present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive, mult-agent trading and directly impact price dynamics. By simulating realistic bid-ask interactions, our platform enables agents to train in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments show that LLMs struggle with numerical reasoning when given plain-text data, tending to overfit local patterns and recent values. In contrast, chart-based visualizations significantly boost both numerical reasoning and trading performance. Moreover, integrating a reflection module yields further improvements, especially with visual inputs. Finally, evaluations of the NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://anonymous.4open.science/r/Agent-Trading-Arena-8412.

Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents

Competitive programming problems, due to their high reasoning difficulty and precise correctness feedback, have become a key benchmark for evaluating the reasoning capabilities of large language models (LLMs), playing a pivotal role in both LLM evaluation and reinforcement learning (RL) training. However, while available public datasets gather problems from platforms like Codeforces and somehow generate additional test cases, their test cases often fall short in quality compared to official ones, resulting in inaccurate evaluations. In this paper, we introduce an agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new dataset with improved test cases, CodeContest+. We evaluated the accuracy of both using 1.72 million real-world submissions. Results show that CodeContests+ has a significantly higher evaluation accuracy than CodeContests and has better performance in RL training.

CodeContests+: High-Quality Test Case Generation for Competitive Programming

Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they are not good at 2 basic cases, "multi-matching retrieval,'' and "logic-based retrieval'', which are beyond LCLMs' ability boundary. But we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts, indicating the potential necessity of combining long-context tasks with CoT methods for more advanced long context handling. However, purely CoT-based methods are too time-consuming when the context is very long, which means accurate and efficient long-context handling still has a long way to go.

Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps

The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present \texttt{CANDY}, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of sim20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code will be openly released. Data samples can be accessed at \url{https://anonymous.4open.science/r/CANDY-E9F8}.

CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Large language models (LLMs) have shown promise in table Question Answering (Table QA). However, extending these capabilities to multi-table QA remains challenging due to unreliable schema linking across complex tables. Existing methods based on semantic similarity work well only on simplified hand-crafted datasets and struggle to handle complex, real-world scenarios with numerous and diverse columns. To address this, we propose a graph-based framework that leverages human-curated relational knowledge to explicitly encode schema links and join paths. Given a natural language query, our method searches this graph to construct interpretable reasoning chains, aided by pruning and sub-path merging strategies to enhance efficiency and coherence. Experiments on both standard benchmarks and a realistic, large-scale dataset demonstrate the effectiveness of our approach. To our knowledge, this is the first multi-table QA system applied to truly complex industrial tabular data.

Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance

The evaluation of mathematical reasoning capabilities constitutes a critical pathway toward achieving Artificial General Intelligence (AGI). Prevailing benchmarks including MATH and AIME mainly feature single-instantiation problems with fixed numbers, permitting pattern matching instead of principled deductive reasoning and leaving generalization on isomorphic problem variants untested. To address these limitations, we propose the UTMath Benchmark, employing rigorous unit testing methodology that simultaneously quantifies solution accuracy and solution space generality. It comprises 1,053 problems spanning 9 mathematical domains, each accompanied by an average of 68 varied test cases. With answer possibilities per problem on average, UTMath sets new standards for robust reasoning while preventing memorization. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%. We further propose Reasoning-to-Code Thoughts (RCoT), a prompting strategy that decouples symbolic reasoning from code synthesis. RCoT guides LLMs to first derive formal reasoning structures before generating executable code, producing generalizable solutions rather than situation-specific answers. To help the community push mathematical reasoning further, we release UTMath-Train (70k samples), a companion training set generated under the same protocol. Our benchmark can be accessed via the following link: [UTMath](https://anonymous.4open.science/r/UTMath-3356)

Downloads

Next from EMNLP 2025

ThinkSwitcher: When to Think Hard, When to Think Fast

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES