China

Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia? In this paper, we focus on inconsistencies, a specific type of factual inaccuracy. We introduce the task of corpus-level inconsistency detection and present WikiCollide, a human-annotated dataset for this task. We also propose CLAIRE, an agent-based system combining an LLM with information retrieval to effectively identify inconsistencies, which outperforms strong LLM baselines by 2.1% in terms of AUROC on our dataset. Based on our findings, we estimate that at least 79.9 million facts (approximately 3.3%) in the English Wikipedia contradict at least one other fact within the corpus (99% confidence interval: 37.6 million to 121.9 million). We further show that these inconsistencies propagate into widely-used NLP datasets, affecting gold labels in at least 7.3% of examples in the fact-verification dataset FEVEROUS and 4.0% in the question-answering dataset AmbigQA. In a user study with experienced Wikipedia editors, 87.5% of participants reported increased confidence in identifying inconsistencies when using CLAIRE, discovering on average 64.7% more inconsistencies in the same amount of time. Our results demonstrate that LLM-based tools can effectively assist humans in detecting inconsistencies in large-scale corpora.

EMNLP 2025

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

inconsistency detection

llms

agents

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

We introduce a novel Question Answering (QA) architecture that enhances the selection of answers by retrieving targeted supporting evidence. Unlike traditional systems that retrieve documents or passages relevant solely to a query q, our approach retrieves content relevant to the combination (q,a), focusing explicitly on the supporting relationship between the query and the answer a. By prioritizing this relational context, our method identifies paragraphs that directly substantiate the correctness of a for q, achieving higher accuracy compared to standard retrieval systems. Furthermore, we demonstrate that our neural retrieval approach efficiently scales to retrieve answer supports from hundreds of millions of paragraphs, setting a new benchmark in QA performance.

Retrieving Support to Rank Answers in Open-Domain Question Answering

Knowledge Graph Question Answering (KGQA) aims to answer natural language questions based on knowledge graphs. Recent approaches apply the Retrieval-Augmented Generation (RAG) paradigm to incorporate Large Language Models (LLMs) to this task, where a retriever selects a question-related subgraph and an LLM-based generator is then adopted to predict answers based on the retrieved subgraph. However, the subgraph selection process is non-differentiable, preventing end-to-end training of the retriever and the generator in these approaches, which leads to sub-optimal performance. To overcome this limitation, this paper proposes a Differentiable RAG (D-RAG) approach that jointly optimizes the retriever and the generator for KGQA. Via reformulating the optimization objective as an expectation over a subgraph distribution with respect to answer generation likelihood, D-RAG makes the joint optimization feasible. Specifically, it implements this joint optimization through a differentiable subgraph sampling and prompting module that integrates Gumbel-Softmax reparameterization for sampling and a neural prompt construction process that fuses semantic and structural information. Experimental results on WebQSP and CWQ demonstrate that D-RAG outperforms state-of-the-art approaches.

D-RAG: Differentiable Retrieval-Augmented Generation for Knowledge Graph Question Answering

Language models such as GPT and Llama have shown remarkable ability on diverse natural language tasks, yet their performance on complex table tasks (e.g., NL-to-Code, data cleaning, etc.) continue to be sub-optimal. To improve their performance, task-specific fine-tuning is often needed, which however require expensive human labeling, and is prone to over-fitting. In this work, we propose Table-Specialist, a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger Table-Specialist models that can specialize in a given task, without using manually-labeled data. Extensive evaluations of Table-Specialist on Llama, GPT-3.5 and GPT-4 suggest that our Table-Specialist has (1) strong performance on diverse table tasks over vanilla language-models -- for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) lower cost to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) better generalizability when evaluated across multiple benchmarks, since Table-Specialist is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code, data, and technical report are available for future research.

Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Fine-tuning

Large Language Models (LLMs), particularly smaller variants, still struggle with complex reasoning tasks. While inference-time prompting can guide reasoning, existing methods often rely on sequential queries. Ensemble approaches offer a promising path to performance gains, especially given recent batch inference speed-ups. This work introduces DIPPER, a novel, training-free framework that transforms a single LLM into an effective inference-time ensemble. By feeding the model an optimized and diverse set of prompts in parallel, DIPPER elicits varied reasoning paths, leading to performance gains. We empirically demonstrate significant improvements on mathematical reasoning benchmarks, such as MATH, where a DIPPER ensemble of three Qwen2-MATH-1.5B instances (via parallel prompting of a single model) outperforms a larger Qwen2-MATH-7B model.

Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning Tasks

Most existing work on event extraction has focused on sentence-level texts and presumes the identification of a \trigger-span --- a word or phrase in the input that evokes the occurrence of an event of interest. Event arguments are then extracted with respect to the trigger. Indeed, triggers are treated as integral to, and trigger detection as an essential component of, event extraction. In this paper, we provide the first investigation of the role of triggers for the more difficult and much less studied task of document-level event extraction. We analyze their usefulness in multiple end-to-end and pipelined transformer-based event extraction models for three document-level event extraction datasets, measuring performance using triggers of varying quality (human-annotated, LLM-generated, keyword-based, and random). We find that whether or not systems benefit from explicitly extracting triggers depends both on dataset characteristics (i.e. the typical number of events per document) and task-specific information available during extraction (i.e. natural language event schemas). Perhaps surprisingly, we also observe that the mere existence of triggers in the input, even random ones, is important for prompt-based in-context learning approaches to the task.

Are Triggers Needed for Document-Level Event Extraction?

Large Language Models (LLMs) offer a promising alternative to traditional Materials Science Text Mining (MSTM) by reducing the need for extensive data labeling and fine-tuning. However, existing zero-/few-shot methods still face limitations in aligning with personalized needs in scientific discovery. To address this, we propose ClassMATe, an active knowledge structuring approach for MSTM. Specifically, we first propose a class definition stylization method to structure knowledge, enabling explicit clustering of latent material knowledge in LLMs for enhanced inference. To align with the scientists' needs, we propose an active needs refining strategy that iteratively clarifies needs by learning from uncertainty-aware hard samples of LLMs, further refining the knowledge structuring. Extensive experiments on seven tasks and eight datasets show that ClassMATe, as a plug-and-play method, achieves performance comparable to supervised learning without requiring fine-tuning or extra knowledge base, highlighting the potential to bridge the gap between LLMs' latent knowledge and real-world scientific applications.

Active Knowledge Structuring for Large Language Models in Materials Science Text Mining

In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud’s Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community. The toolkit, together with source codes, all model checkpoints and datasets, is released at: https://github.com/modelscope/easydistill.

EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models

Explainability in automated student answer scoring systems is critical for building trust and enhancing usability among educators. Yet, generating high-quality assessment rationales remains challenging due to the scarcity of annotated data and the prohibitive cost of manual verification, prompting heavy reliance on rationales produced by large language models (LLMs), which are often noisy and unreliable. To address these limitations, we present AERA Chat, an interactive visualization platform designed for automated explainable student answer assessment. AERA Chat leverages multiple LLMs to concurrently score student answers and generate explanatory rationales, offering innovative visualization features that highlight critical answer components and rationale justifications. The platform also incorporates intuitive annotation and evaluation tools, supporting educators in marking tasks and researchers in evaluating rationale quality from different models. We demonstrate the effectiveness of our platform through evaluations of multiple rationale-generation methods on several datasets, showcasing its capability for facilitating robust rationale evaluation and comparative analysis.

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

Autoregressive models dominate text generation but suffer from left-to-right decoding constraints that limit efficiency and bidirectional reasoning. Diffusion-based models offer a flexible alternative but face challenges in adapting to discrete text efficiently. We propose LAD (LoRA-Adapted Diffusion), a framework for non-autoregressive generation that adapts LLaMA models for iterative, bidirectional sequence refinement using LoRA adapters. LAD employs a structural denoising objective combining masking with text perturbations (swaps, duplications and span shifts), enabling full sequence editing during generation. We aim to demonstrate that LAD could be a viable and efficient alternative to training diffusion models from scratch, by providing both validation results as well as two interactive demos directly available online: https://ruurdkuiper.github.io/tini-lad/ https://huggingface.co/spaces/Ruurd/tini-lad Inference and training code: https://github.com/RuurdKuiper/lad-code

LAD: LoRA-Adapted Diffusion

Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case Studies highlight how Pico can support iterative small LM design and analysis.

Downloads

Next from EMNLP 2025

Retrieving Support to Rank Answers in Open-Domain Question Answering

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES