China

Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practice for human evaluation. However, randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop and analyze a suite of selectors to get the most informative datapoints for human evaluation, taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that only ~70% of the test data is needed to produce the same evaluation result as the entire data.

EMNLP 2025

How to Select Datapoints for Efficient Human Evaluation of NLG Models?

subset selection

human evaluation

evaluation

machine translation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with n tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.

Hierarchical Bracketing Encodings Work for Dependency Graphs

Pre-hospital Emergency Care (PEC) systems are critical for managing life-threatening emergencies where rapid intervention can significantly impact patient outcomes. The rising global demand for PEC services, coupled with increased emergency calls and strained emergency departments, necessitates efficient resource utilization through Telephone Triage (TT) systems. However, existing TT processes face challenges such as incomplete data collection, communication barriers, and manual errors, leading to high over-triage and under-triage rates. This study proposes InTriage, an AI-driven multilingual TT system to provide decision support for triage. InTriage enhances accuracy by transcribing emergency calls, extracting critical patient information, prompting supplementary, and providing real-time triage decisions support. We conducted an evaluation on a real-world corpus of approximately 40 hours of telephone data, achieving a word error rate of 14.57% for speech recognition and an F1 score of 73.34% for key information extraction. By improving communication efficiency and reducing triage errors, InTriage offers a scalable solution to potentially help address the growing demands on PEC systems globally.

InTriage: Intelligent Telephone Triage in Pre-Hospital Emergency Care

Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work is open-sourced at https://github.com/byaidu/pdfmathtranslate with more than 222k downloads.

PDFMathTranslate: Scientific Document Translation Preserving Layouts

Political pledges reflect candidates' policy commitments, but tracking their fulfilment requires reasoning over incremental evidence distributed across multiple, dynamically updated sources. Existing methods simplify this task into a document classification task, overlooking its dynamic temporal and multi-document nature. To address this issue, we introduce \textsc{PledgeTracker}, a system that reformulates pledge verification into structured event timeline construction. PledgeTracker consists of three core components: (1) a multi-step evidence retrieval module; (2) a timeline construction module and; (3) a fulfilment filtering module, allowing the capture of the evolving nature of pledge fulfilment and producing interpretable and structured timelines. We evaluate PledgeTracker in collaboration with professional fact-checkers in real-world workflows, demonstrating its effectiveness in retrieving relevant evidence and reducing human verification effort.

PledgeTracker: A System for Monitoring the Fulfilment of Pledges

We present SciClaims, an interactive web-based system for end-to-end scientific claim analysis in the biomedical domain. Designed for high-stakes use cases such as systematic literature reviews and patent validation, SciClaims extracts claims from text, retrieves relevant evidence from PubMed, and verifies their veracity. The system features a user-friendly interface where users can input scientific text and view extracted claims, predictions, supporting or refuting evidence, and justifications in natural language. Unlike prior approaches, SciClaims seamlessly integrates the entire scientific claim analysis process using a single large language model, without requiring additional fine-tuning. SciClaims is optimized to run efficiently on a single GPU and is publicly available for live interaction at https://labdemos.expertcustomers.ai/health_claims

SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 10 state-of-the-art base VLMs and their variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate the utility of VLM-Lens through prob- ing experiments, revealing systematic differences in the hidden representations of VLMs across layers, and we report its performance in terms of time and memory efficiency. VLM-Lens is released as an open-sourced library to accelerate community efforts in understanding and improving VLMs. Github: https://github.com/compling-wat/vlm-lens

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28\% improvement in BLEU scores and a 15\% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our demo is available here: https://vidove.willbe03.com/

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

High-quality schematic diagrams, which provide a conceptual overview of the research, play a crucial role in summarizing and clarifying a study’s core ideas. However, creating these diagrams is time-consuming for authors and remains challenging for current AI systems, as it requires both a deep understanding of the paper’s content and a strong sense of visual design. To address this, we introduce SCISKETCH, an open-source framework that supports two automated workflows for schematic diagram generation using foundation models, shown in Figure 1. 1) In the graphic-code-based workflow, SCISKETCH follows a two-stage pipeline: it first produces a layout plan expressed in a graphical code language with a self-refinement and self-verification mechanism. It then integrates empirical images and symbolic icons to create a visually coherent, informative diagram. 2) In the image-based workflow, SCISKETCH directly synthesizes the diagram image through image generation with a self-refinement mechanism. Through both automatic and human evaluations, we show that SCISKETCH outperforms several state-of-the-art foundation models, including GPT-4o, and Gemini-2.5-Pro, in generating schematic diagrams for scientific papers. We make SCISKETCH fully open-sourced, providing researchers with an accessible, extensible tool for high-quality schematic diagram generation in scientific fields.

SciSketch: An Open-source Framework for Automated Schematic Diagram Generation in Scientific Papers

The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student's affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student's learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student's emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student's emotions are captured from the conversational text as well as from their facial expressions. The student's emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have effectively evaluated our model using automatic evaluation metrics across eight pedagogical dimensions (Maurya et al., 2025) and user studies. We report a massive 23 point performance gain using the win rate (Macina et al., 2025) and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor's pedagogical abilities by modeling students' emotions.

MathBuddy: A Multimodal System for Affective Math Tutoring

Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions. Our method is being deployed in production.

Downloads

Next from EMNLP 2025

Hierarchical Bracketing Encodings Work for Dependency Graphs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES