China

Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and code-verifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs

EMNLP 2025

Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models

bencmark

large language model

evaluation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use—a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive **over 10 times more usage** from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 10 different models. These phenomenons, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources.

Tool Preferences in Agentic LLMs are Unreliable

Recent advances in large language models (LLMs) have yielded impressive gains on mathematical reasoning benchmarks via supervised fine-tuning (SFT). However, the brittleness of these models under input perturbations has cast doubt on whether such improvements reflect genuine reasoning abilities or merely superficial alignment with expected output formats. We investigate the mechanisms behind SFT improvements in small-scale LLMs, addressing four key questions: (1) Are performance gains primarily due to format alignment rather than reasoning? (2) Can high-quality supervision encourage genuine reasoning? (3) Does scaling data shift learning from format alignment to deeper reasoning? (4) Are format alignment gains consistent across model sizes and architectures? Through controlled experiments, we find that most performance improvements arise from format alignment rather than genuine reasoning enhancement. Moreover, SFT's effectiveness is strongly influenced by the alignment between the base model’s inductive biases and the teacher model’s output distribution, rather than the teacher’s raw strength. Finally, scaling up training data offers diminishing returns and does not fundamentally alter the model’s reasoning behavior. These findings suggest that current SFT practices may overestimate the reasoning abilities of LLMs and underscore the need for more rigorous evaluation methods.

The Emperor's New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT

Many statistical facts are conveyed through charts. While various methods have emerged for chart understanding, chart generation typically requires users to manually input code, intent, and other parameters to obtain the desired format on chart generation tools. Recently, the advent of image-generating Large Language Models has facilitated chart generation; however, even this process often requires users to provide numerous constraints for accurate results. In this paper, we propose a loop-based framework for automatically evolving charts in a multi-agent environment. Within this framework, three distinct agents—Chart Code Generator, Chart Replier, and Chart Quality Evaluator—collaborate for iterative, user-tailored chart generation using large language models. Our approach demonstrates an improvement of up to 29.97% in performance compared to first generation, while also reducing generation time by up to 86.9% compared to manual prompt-based methods, showcasing the effectiveness of this multi-agent collaboration in enhancing the quality and efficiency of chart generation.

AMACE: Automatic Multi-Agent Chart Evolution for Iteratively Tailored Chart Generation

Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance. Codes, dataset, and checkpoints will be available at GitHub.

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Topic evolution and stance dynamics are deeply intertwined in online social media, shaping the fragmentation and polarization of public discourse. Yet existing dynamic topic models and stance analysis approaches usually consider these processes in isolation, relying on abstractions that lack interpretability and agent-level behavioral fidelity. We present stance and topic evolution reasoning framework (SPARK), the first LLM-based multi-agent simulation framework for jointly modeling the co-evolution of topics and stances through natural language interactions. In SPARK, each agent is instantiated as an LLM persona with unique demographic and psychological traits, equipped with memory and reflective reasoning. Agents engage in daily conversations, adapt their stances, and organically introduce emergent subtopics, enabling interpretable, fine-grained simulation of discourse dynamics at scale. Experiments across five real-world domains show that SPARK captures key empirical patterns—such as rapid topic innovation in technology, domain-specific stance polarization, and the influence of personality on stance shifts and topic emergence. Our framework quantitatively reveals the bidirectional mechanisms by which stance shifts and topic evolution reinforce each other, a phenomenon rarely addressed in prior work. SPARK provides actionable insights and a scalable tool for understanding and mitigating polarization in online discourse. Code and simulation resources will be released after acceptance.

SPARK: Simulating the Co-evolution of Stance and Topic Dynamics in Online Discourse with LLM-based Agents

The large language models (LLMs) learning framework for math problem generation (MPG) mostly performs homogeneous training in different epochs on small-scale manually annotated data. This pattern struggles to provide large-scale new quality data to support continual improvement, and fails to stimulate the mutual promotion reaction between generation and reasoning ability of math problem, resulting in the lack of reliable solving process. This paper proposes a synthetic data based continual learning framework to improve LLMs ability for MPG and math reasoning. The framework cycles through three stages, “supervised fine-tuning, data synthesis, direct preference optimization”, continuously and steadily improve performance. We propose a synthetic data method with dual mechanism of model self-play and multi-agent cooperation is proposed, which ensures the consistency and validity of synthetic data through sample filtering and rewriting strategies, and overcomes the dependence of continual learning on manually annotated data. A data replay strategy that assesses sample importance via loss differentials is designed to mitigate catastrophic forgetting. Experimental analysis on abundant authoritative math datasets demonstrates the superiority and effectiveness of our framework.

Empowering Math Problem Generation and Reasoning for Large Language Model via Synthetic Data based Continual Learning Framework

Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form CoT is insufficient and sometimes counterproductive for structured function-calling tasks. To address this, we introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings. Experimental results show that our method reduces tool-use errors, achieving 3–12% relative improvements over strong baselines across diverse model series and approaches. Moreover, our framework enhances the robustness, interpretability, and transparency of tool-using agents, advancing the development of more reliable AI assistants for real-world applications.

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

Code Execution as Grounded Supervision for LLM Reasoning

Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents **3MDBench** (**M**edical **M**ultimodal **M**ulti-agent **D**ialogue **Bench**mark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at [https://github.com/univanxx/3mdbench](https://github.com/univanxx/3mdbench).

3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

While Direct Preference Optimization (DPO) eliminates complex reward modeling in aligning large language models (LLMs) with human preferences, its online variant faces significant efficiency bottlenecks due to costly real-time preference sampling and the reward model annotation. We propose a novel framework that bridges offline-to-online alignment by systematically transforming static datasets into dynamically adaptive equivalents, without the need for an explicit reward model. Our approach employs paraphrasing techniques to preserve response correctness while aligning data distributions with model-generated outputs, circumventing the need for resource-intensive online interactions. Experiments on mathematical reasoning and conversational tasks demonstrate that our method matches or exceeds the performance of a fully online DPO. This work establishes a computationally sustainable paradigm for LLM alignment, particularly benefiting scenarios requiring iterative preference updates and domain adaptation.

Downloads

Next from EMNLP 2025

Tool Preferences in Agentic LLMs are Unreliable

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES