China

Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems.
We present IPR---a quality-constrained Intelligent Prompt Routing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels.
IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter τ in [0,1] that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. 
To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates.
Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency.

EMNLP 2025

IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Interpreting visual scenes with high-level reasoning is essential for many real-world applications—from autonomous systems to content moderation—but training and maintaining Vision-Language Models (VLMs) remains resource-intensive and opaque. In this work, we present CAPSTONE, a lightweight and modular framework designed for industrial settings. Instead of relying on multimodal training or fine-tuning large models, CAPSTONE transforms outputs from off-the-shelf vision models into structured text prompts that can be interpreted by a frozen Large Language Model (LLM). This plug-and-play architecture enables reasoning over visual input without access to raw pixels, dramatically reducing computational cost and complexity. On the POPE dataset, our system—using a 7B LLM—outperforms several fully trained VLMs in zero-shot evaluations, demonstrating strong generalization without retraining. CAPSTONE offers a scalable and interpretable alternative for companies looking to integrate visual reasoning capabilities without the burden of full-scale VLM pipelines.

CAPSTONE: Composable Attribute‑Prompted Scene Translation for Zero‑Shot Vision–Language Reasoning

In the domain of text-to-image generative models, biases inherent in training datasets often propagate into generated content, posing significant ethical challenges, particularly in socially sensitive contexts. We introduce FairCoT, a novel framework that enhances fairness in text-to-image models through Chain-of-Thought (CoT) reasoning within multimodal generative large language models. FairCoT employs iterative CoT refinement to systematically mitigate biases, and dynamically adjusts textual prompts in real time, ensuring diverse and equitable representation in generated images. By integrating iterative reasoning processes, FairCoT addresses the limitations of zero-shot CoT in sensitive scenarios, balancing creativity with ethical responsibility. Experimental evaluations across popular text-to-image systems—including DALL-E and various Stable Diffusion variants—demonstrate that FairCoT significantly enhances fairness and diversity without sacrificing image quality or semantic fidelity. By combining robust reasoning, lightweight deployment, and extensibility to multiple models, FairCoT represents a promising step toward more socially responsible and transparent AI-driven content generation.

FairCoT: Enhancing Fairness in Text-to-Image Generation via Chain of Thought Reasoning with Multimodal Large Language Models

Large Language Models (LLMs) often exhibit social biases inherited from their training data. While existing benchmarks evaluate bias by term-based mode through direct term associations between demographic terms and bias terms, LLMs have become increasingly adept at avoiding biased responses, leading to seemingly low levels of bias. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Description-based Bias Benchmark (DBB), a novel dataset designed to assess bias at the semantic level that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios rather than superficial terms. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response at the term level, they continue to reinforce biases in nuanced settings. Data, code, and results are available at \url{https://github.com/JP-25/Description-based-Bias-Benchmark}.

What's Not Said Still Hurts: A Description-Based Evaluation Framework for Measuring Social Bias in LLMs

We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples, where natural language questions are translated to executable SQL queries. This representation is derived semi-automatically with minimal human intervention. We demonstrate its utility in evaluation by closely analyzing (i) the composition of widely-used benchmarks and (ii) model performance at a granular level beyond overall accuracy scores. Our analyses not only reveal example subsets that challenge all models, including those with strongest overall performance, but more importantly, specific example classes where smaller, cheaper models perform comparably to frontier models. Finally, we show a practical application of SQLSpace at inference time, using our representation to predict which natural language questions will likely yield incorrect SQL from a text-to-SQL model, and rewriting such questions to improve accuracy.

SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps

Effective disaster management requires timely access to accurate and contextually relevant information. Existing Information Retrieval (IR) benchmarks, however, focus primarily on general or specialized domains, such as medicine or finance, neglecting the unique linguistic complexity and diverse information needs encountered in disaster management scenarios. To bridge this gap, we introduce DisastIR, the first comprehensive IR evaluation benchmark specifically tailored for disaster management. DisastIR comprises 9,600 diverse user queries and more than 1.3 million labeled query-passage pairs, covering 48 distinct retrieval tasks derived from six search intents and eight general disaster categories that include 301 specific event types. Our evaluations of 30 state-of-the-art retrieval models demonstrate significant performance variances across tasks, with no single model excelling universally. Furthermore, comparative analyses reveal significant performance gaps between general-domain and disaster management-specific tasks, highlighting the necessity of disaster management-specific benchmarks for guiding IR model selection to support effective decision-making in disaster management scenarios. All source codes and DisastIR are available at https://anonymous.4open.science/r/Disaster_IR/.

DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster Management

Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit \textit{sycophancy}—conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce \textbf{SYCON Bench} (\textbf{SY}cophantic \textbf{CON}formity benchmark), a novel evaluation suite that assesses sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (\textit{Turn of Flip}) and how frequently it shifts its stance under sustained user pressure (\textit{Number of Flip}). Applying \textsc{SYCON Bench} to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario.

Measuring Sycophancy of Language Models in Multi-turn Dialogues

Chat-oriented dialogue systems that deliver tangible benefits, such as sharing news or frailty prevention for seniors, require Proactive acquisition of specific user Information Via chats On user-favored Topics (PIVOT). This study proposes the PIVOT task to support the development of these systems. In this task, a system needs to acquire a user's answers to predefined questions without making the user feel abrupt while engaging in a chat on a predefined topic. We created and analyzed a dataset of 650 PIVOT chats, identifying key challenges and effective strategies for recent LLMs. Our system, designed from these insights, surpassed the performance of LLMs prompted solely with task instructions. Finally, we demonstrate that automatic evaluation of this task is reasonably accurate, suggesting its potential as a framework to efficiently develop techniques for systems dealing with complex dialogue goals, extending beyond the scope of PIVOT alone.

Proactive User Information Acquisition via Chats on User-Favored Topics

Large Language Models (LLMs) such as GPT and LLaMA excel in natural language tasks, e.g., text generation and machine translation. However, inherent biases from training on vast Internet datasets potentially amplify harmful stereotypes—widely held, oversimplified, and often inaccurate generalizations about groups of people. Our contribution introduces a novel, systematic, and architecture-aware method to identify and mitigate stereotypical bias in decoder-only transformer models. This interpretable approach operates without gradient access or retraining from scratch. We first evaluate bias and then apply a bias localization mechanism that correlates internal activations with a newly defined Context Influence (CI) Score. Our method pinpoints specific attention heads that consistently align with biased shifts in model predictions. To mitigate this, we introduce a soft pruning strategy that scales attention head parameters based on their correlation strength, followed by lightweight fine-tuning to maintain fluent text generation. Experiments across five models demonstrate our approach reduces bias by up to 37% on BBQ, 32% on StereoSet, and 33% on CrowS-Pairs while simultaneously improving reasoning performance on MMLU by up to 10%. \footnote{Data and code are avaliable here \url{https://anonymous.4open.science/status/mitigate_bias}}

Toward Inclusive Language Models: Sparsity-Driven Calibration for Systematic and Interpretable Mitigation of Social Biases in LLMs

Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response y⁺ and dispreferred response y⁻ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-N² pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models' performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.

DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization

Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference.

Downloads

Next from EMNLP 2025

CAPSTONE: Composable Attribute‑Prompted Scene Translation for Zero‑Shot Vision–Language Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

CAPSTONE: Composable Attribute‑Prompted Scene Translation for Zero‑Shot Vision–Language Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads