China

Routing large language models (LLMs) is a new paradigm that uses a router to recommend the best LLM from a pool of candidates for a given input. In this paper, our comprehensive analysis with more than 8,500 LLMs reveals a novel model-level scaling up phenomenon in Routing LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even surpass the performance of the best single model in the pool and many existing strong LLMs, confirming it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark tailored for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across various areas such as commonsense reasoning, semantic understanding, etc., based on over 8,500 various LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement.

EMNLP 2025

RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs

model-level scaling up

routing llms

evaluation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of textbffactual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. Although post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in different capabilities. In this paper, we propose to address it by directly augmenting LLM's fundamental ability to precisely leverage its knowledge and introduce textbfPKUE, which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct textbfFactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments demonstrate that PKUE significantly improves LLM overall performance, with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in a different language.

Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization

Medical question answering fundamentally relies on accurate clinical knowledge. The dominant paradigm, Retrieval-Augmented Generation (RAG), acquires expertise \textit{conceptual} knowledge from large-scale medical corpus to guide general-purpose large language models (LLMs) in generating trustworthy answers. However, existing retrieval approaches often overlook the patient-specific \textit{factual knowledge} embedded in Electronic Health Records (EHRs), which limits the contextual relevance of retrieved \textit{conceptual knowledge} and hinders its effectiveness in vital clinical decision-making. This paper introduces RGAR, a recurrence generation-augmented retrieval framework that synergistically retrieves both \textit{factual} and \textit{conceptual} knowledge from dual sources (i.e., EHRs and the corpus), allowing mutual refinement through iterative interaction. Across three factual-aware medical QA benchmarks, RGAR establishes new state-of-the-art performance among medical RAG systems. Notably, RGAR enables the Llama-3.1-8B-Instruct model to surpass the considerably larger GPT-3.5 augmented with traditional RAG. Our findings demonstrate the benefit of explicitly mining patient-specific factual knowledge during retrieval, consistently improving generation quality and clinical relevance.

RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization (PTQ) techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler (KL) divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. Our work provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Experimental results demonstrate that FlexQuant achieves a 1.3× end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment.

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

While Large Language Model-based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single-task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real-world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time interval following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state-of-the-art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We will open-source our benchmark and code.

Recipe2Plan: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions

Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose GRADE, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. GRADE enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.

GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the \textit{“lost in the middle”} issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose \textit{Tree of Agents (TOA)}, a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at \url{https://anonymous.4open.science/r/Tree-of-Agents-4C9B/

Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs~\footnote{Code and data is released at \url{https://github.com/xxx}.}.

Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning

Large language models (LLMs) offer a novel and convenient avenue for humans to acquire knowledge. However, LLMs are prone to providing "midguy" answers regardless of users' knowledge background, thereby failing to meet each user's personalized needs. To tackle the problem, we propose to generate personalized answers with LLMs based on users' past question-answering records. We dynamically generate and update a user's domain and global profiles as the user asks questions, and use the latest profile as the context to generate the answer for a newly-asked question. To save tokens, we propose to compress the domain profile into a set of keywords and use the keywords to prompt LLMs. We theoretically analyze the effectiveness of the compression strategy. Experimental results show that our method can generate more personalized answers than comparative methods. The code and dataset are available at https://anonymous.4open.science/r/PRF-QA-169B.

Personalized Question Answering with User Profile Generation and Compression

Typically, parametric adaptation methods such as domain-adaptive pretraining (DAP) and retrieval-augmented generation (RAG) have been considered effective approaches for adapting large language models (LLMs) to new knowledge or domains. To unify positive effects of parametric adaptation and RAG, this paper proposes GenPoE, i.e., "generative" passage-level mixture of experts (MoEs) for enhancing knowledge of LLMs. The key component is its novel MoE-generating hypernetwork which takes in-context retrieved passages and generates their "expert’’ parameters, where these generated parameters are then integrated into LLMs by forming expert networks. With its use of "generated" parameters, GenPoE does not require a separate parameter training or finetuning stage, which is often costly. By parameterizing passages into expert networks, GenPoE likely exhibits robustness even when the retrieved passages are irrelevant. Experiment results in two open-domain question answering (QA) tasks present that GenPoE shows improved performances over other passage-level knowledge editing, and its combination of RAG produces superior performances over RAG. Our data and code will be available at \url{https://github.com/XXX/XXX}.

GenPoE: Generative Passage-level Mixture of Experts for Knowledge Enhancement of LLMs

Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often lack a granular focus on RAG tasks or a deeper utilization of chain-of-thought processes. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a "think before answering" strategy. This method enhances the model's open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model's performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.

Downloads

Next from EMNLP 2025

Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES