China

The lack of high-quality test collections challenges Information Retrieval (IR) in specialized domains. This work addresses this issue by comparing supervised classifiers against zero-shot Large Language Models (LLMs) for automated relevance annotation in the oil and gas industry, using human expert judgments as a benchmark. A supervised classifier, trained on limited expert data, outperforms LLMs, achieving an F1-score that surpasses even a second human annotator. The study also empirically confirms that LLMs are susceptible to unfairly prefer technologically similar retrieval systems. While LLMs lack precision in this context, a well-engineered classifier offers an accurate and practical path to scaling evaluation datasets within a human-in-the-loop framework that empowers, not replaces, human expertise.

EMNLP 2025

Analysis of Automated Document Relevance Annotation for Information Retrieval in Oil and Gas Industry

relevance classification

large language models

information retrieval

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Classifying contact center interactions into a large number of categories is critical for downstream analytics, but challenging due to high label cardinality, and cost constraints. While Large Language Models (LLMs) offer flexibility for such tasks, existing methods degrade with increasing label space, showing significant inconsistencies and sensitivity to label ordering. We propose a scalable, cost-effective two-step retrieval-augmented classification framework, enhanced with a multi-view representation of labels. Our method significantly improves accuracy and consistency over baseline LLM approaches. Experiments across 4 private and 5 open datasets yield performance improvements of upto 14.6% while reducing inference cost by 60-91% compared to baseline approaches.


Scalable and Cost Effective High-Cardinality Classification with LLMs via Multi-View Label Representations and Retrieval Augmentation

We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents' cognitive load and redundant review. Our approach combines a fine-tuned Mixtral-8×7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3\% reduction in case handling time compared to bulk summarization (with reductions of up to 9\% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.

Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is “trained” through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. 
Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models.

LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

Every year, the European Union and its member states allocate millions of euros to fund various development initiatives. However, the increasing number of applications received for these programs often creates significant bottlenecks in evaluation processes, due to limited human capacity. In this work, we detail the real-world deployment of AI-assisted evaluation within the pipeline of two government initiatives: (i) corporate applications aimed at international business expansion, and (ii) citizen reimbursement claims for investments in energy-efficient home improvements. While these two cases involve distinct evaluation procedures, our findings confirm that AI effectively enhanced processing efficiency and reduced workload across both types of applications. Specifically, in the citizen reimbursement claims initiative, our solution increased reviewer productivity by 20.1%, while keeping a negligible false-positive rate based on our test set observations. These improvements resulted in an overall reduction of more than 2 months in the total evaluation time, illustrating the impact of AI-driven automation in large-scale evaluation workflows.

Leveraging LLMs to Streamline the Review of Public Funding Applications

Answering "Where is the X button?" with "It's next to the Y button" is unhelpful if the user knows neither location. Useful answers require obvious landmarks as a reference point. We address this by generating from a vehicle dashboard diagram a spatial knowledge graph (SKG) that shows the spatial relationship between a dashboard component and its nearby landmarks and using the SKG to help answer questions. We evaluate three distinct generation pipelines (Per-Attribute, Per-Component, and a Single-Shot baseline) to create the SKG using Large Vision-Language Models (LVLMs). On a new 65-vehicle dataset, we demonstrate that a decomposed Per-Component pipeline is the most effective strategy for generating a high-quality SKG; the graph produced by this method, when evaluated with a novel Significance score, identifies landmarks achieving 71.3% agreement with human annotators. This work enables downstream QA systems to provide more intuitive, landmark-based answers.

Generating Spatial Knowledge Graphs from Automotive Diagrams for Question Answering

In the field of urban planning, existing Vision-Language Models (VLMs) frequently fail to effectively analyze planning maps, which are critical for urban planners and educational contexts. Planning maps require specialized understanding of spatial configurations, regulatory requirements, and multi-scale analysis.


To address this challenge, we introduce PlanGPT-VL, the first domain-specific VLM tailored for urban planning maps. PlanGPT-VL employs three innovations:
(1) PlanAnno-V framework for high-quality VQA data synthesis,
(2) Critical Point Thinking (CPT) to reduce hallucinations through structured verification, and
(3) PlanBench-V benchmark for systematic evaluation.


Evaluation on PlanBench-V shows that PlanGPT-VL outperforms general-purpose VLMs on planning map interpretation tasks, with our 7B model achieving performance comparable to larger 72B models.

PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models

Retrieval-Augmented Generation (RAG) has shown strong performance in open-domain tasks, but its effectiveness in industrial domains is limited by a lack of domain understanding and document structural elements (DSE) such as tables, figures, charts, and formula.
To address this challenge, we propose an efficient knowledge distillation framework that transfers complementary knowledge from both Large Language Models (LLMs) and Vision-Language Models (VLMs) into a compact domain-specific retriever.
Extensive experiments and analysis on real-world industrial datasets from shipbuilding and electrical equipment domains demonstrate that the proposed framework improves both domain understanding and visual-structural retrieval, outperforming larger baselines while requiring significantly less computational complexity.

Distilling Cross-Modal Knowledge into Domain-Specific Retrievers for Enhanced Industrial Document Understanding

Generating effective query suggestions in conversational search requires aligning model outputs with user preferences, which is challenging due to sparse and noisy click signals. We propose \textbf{GQS}, a generative framework that integrates click modeling and preference optimization to enhance real-world user engagement.
GQS consists of three key components: (1) a \textit{Multi-Source CTR Modeling} module that captures diverse contextual signals to estimate fine-grained click-through rates; (2) a \textit{Diversity-Aware Preference Alignment} strategy using CTR-weighted Direct Preference Optimization (DPO), which balances relevance and semantic diversity; and (3) a \textit{CTR-Calibrated Iterative Optimization} process that jointly refines the CTR and generation models across training rounds.
Experiments on two real-world tasks demonstrate that GQS outperforms strong baselines in CTR, relevance, and diversity.

CTR-Guided Generative Query Suggestion in Conversational Search

Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

Downloads

Next from EMNLP 2025

Scalable and Cost Effective High-Cardinality Classification with LLMs via Multi-View Label Representations and Retrieval Augmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Scalable and Cost Effective High-Cardinality Classification with LLMs via Multi-View Label Representations and Retrieval Augmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads