China

In many industrial settings, users wish to ask questions in natural language, the answers to which require assembling information from diverse structured data sources. With the advent of Large Language Models (LLMs), applications can now translate natural language questions into a set of API calls or database calls, execute them, and combine the results into an appropriate natural language response.
However, these applications remain impractical in realistic industrial settings because they do not cope with the data source heterogeneity that typifies such environments. In this paper, we simulate the heterogeneity of real industry settings by introducing two extensions of the popular Spider benchmark dataset that require a combination of database and API calls. Then, we introduce and evaluate a declarative approach to handling such data heterogeneity.
We demonstrate that our declarative approach does a significantly better job of coping with data source heterogeneity than state-of-the-art LLM-based agentic or imperative code generation systems. Our augmented benchmarks will soon be available to the research community.

EMNLP 2025

Declarative Techniques for NL Queries over Heterogeneous Data

api sequencing

data heterogeneity

text2sql

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://anonymous.4open.science/r/CDR-Benchmark-B53A.

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

The specialized vocabulary and nuanced concepts of the telecommunications industry pose persistent challenges for standard Natural Language Processing (NLP) models. Generic embedding models often struggle to represent telecom-specific semantics, limiting their utility in retrieval and downstream tasks. We present T-VEC (Telecom Vectorization Model), a domain-adapted embedding model fine-tuned from the gte-Qwen2-1.5B-instruct backbone using a triplet loss objective over 100K curated telecom triplets. T-VEC sets a new benchmark in telecom retrieval, achieving CosineSim@1 of 0.8814, Recall@5 of 0.9249, and Top1 Exact Match of 0.9310—significantly outperforming leading general-purpose models like MPNet, BGE, and E5 by 20-30\% relative margin. These gains confirm T-VEC’s superior domain grounding and retrieval precision, with embedding visualizations further showcasing tight clustering of telecom-relevant concepts. We release T-VEC and its tokenizer to support more robust and semantically faithful NLP applications within the telecom domain.

T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

Cascade systems for open-ended text generation face a fundamental challenge: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose _semantic agreement_—meaning-level consensus between ensemble outputs—as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, semantic cascades improve deferral accuracy, match or surpass target-model quality at 40\% of the cost, and reduce latency by up to 60\%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

Semantic Agreement Enables Efficient Open-Ended LLM Cascades

Large Language Models (LLMs) have transformed natural language processing, yet they still struggle with direct text editing tasks that demand precise, context-aware modifications. While models like ChatGPT excel in text generation and analysis, their editing abilities often fall short, addressing only superficial issues rather than deeper structural or logical inconsistencies. In this work, we introduce a dual approach to enhance LLMs editing performance. First, we present InstrEditBench, a high-quality benchmark dataset comprising over 20,000 structured editing tasks spanning Wiki articles, LaTeX documents, code, and database Domain-specific Languages (DSL). InstrEditBench is generated using an innovative automated workflow that accurately identifies and evaluates targeted edits, ensuring that modifications adhere strictly to specified instructions without altering unrelated content. Second, we propose FineEdit, a specialized model trained on this curated benchmark. Experimental results demonstrate that FineEdit achieves significant improvements around 10% compared with Gemini on direct editing tasks, convincingly validating its effectiveness.

Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation.

Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL

Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (REFINE, RERANK, STOP) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that \ours improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning.

From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval

Supervised fine-tuning (SFT) is a widely used and highly effective method for adapting Large Language Models (LLMs) to specific tasks. However, it often suffers from overfitting, causing models to excel on fine-tuned data but struggle with unseen or rare real-world inputs. While recent methods like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) aim to align LLMs with human values and tasks, they face challenges such as the high cost of human labeling or instabilities and biases inherent in using LLMs as judges. To address these issues, we propose a novel approach called Reinforcement Learning from supervised Alignment (RLA), which constructs a supervised alignment to train the reward model for reinforcement learning. Using only 100,000 MS MARCO samples, our method outperforms RLAIF by a relative margin ranging from +5.38% to +131.8%. It also significantly enhances the baseline Llama3 LLM, achieving up to +55% improvement on in-domain tasks and up to +16% on out-of-domain tasks. While RLA slightly underperforms supervised fine-tuning (SFT) on in-domain benchmarks, it surpasses SFT by up to 50 times on out-of-domain and cross-task evaluations, demonstrating strong generalization capabilities.

Reinforcement Learning with Supervised Alignment

Scientific evaluation of Large Language Models is an important topic that quantifies any degree of progress we make with new models. Even though current LLMs show high level of accuracy on benchmark datasets, the single-sample approach to evaluating them is not sufficient as it ignores high entropy of LLM responses. We introduce a Monte-Carlo evaluation framework for evaluating LLMs that follows behavioral science methodologies and provides statistical guarantees for estimates of performance. We test our framework on multiple LLMs to see if they are susceptible to cognitive biases. We find significant effect of prompts that induce cognitive biases in LLMs, raising questions about their reliability in social sciences and business. We also see higher susceptibility of newer and larger LLMs to cognitive biases, which shows a development towards more human-like and less rational LLM responses. We conclude by calling for the use of Monte-Carlo sampling as opposed to pass@1 for the broader LLM evaluations.

A Monte-Carlo Sampling Framework For Reliable Evaluation of Large Language Models Using Behavioral Analysis

Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce **BehaviorBench**, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose **BehaviorSFT**, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.

Downloads

Next from EMNLP 2025

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads