China

In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization for knowledge editing, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference-optimization yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. 
Additionally, we release test sets from our post-edited data and terminology dictionary.

EMNLP 2025

Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

preference optimization

terminology

translation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Retrieval-Augmented Generation (RAG) is a core approach for enhancing Large Language Models (LLMs), where the effectiveness of the retriever largely determines the overall response quality of RAG systems. Retrievers encompass a multitude of hyperparameters that significantly impact performance outcomes and demonstrate sensitivity to specific applications. Nevertheless, hyperparameter optimization entails prohibitively high computational expenses. Existing evaluation methods suffer from either prohibitive costs or disconnection from domain-specific scenarios. This paper proposes SEARA (Subset sampling Evaluation for Automatic Retriever Assessment), which addresses evaluation data challenges through subset sampling techniques and achieves robust automated retriever evaluation by minimal retrieval facts extraction and comprehensive retrieval metrics. Based on real user queries, this method enables fully automated retriever evaluation at low cost, thereby obtaining optimal retriever for specific business scenarios. We validate our method across classic RAG applications in rednote, including knowledge-based Q&A system and retrieval-based travel assistant, successfully obtaining scenario-specific optimal retrievers.

SEARA: An Automated Approach for Obtaining Optimal Retrievers

Task-Oriented Dialogue (TOD) systems have become increasingly important for real-world applications, yet existing frameworks face significant challenges in handling unstructured information, providing multilingual support, and engaging proactively. We propose SMART (Scalable Multilingual Approach for a Robust TOD System), a novel TOD framework that effectively addresses these limitations. SMART combines traditional pipeline elements with modern agent-based approaches, featuring a simplified dialogue state, intelligent clarification mechanisms, and a unified natural language generation component that eliminates response redundancy. Through comprehensive evaluation on troubleshooting and medical domains, we demonstrate that SMART outperforms baseline systems across key metrics. The system's modular approach enables efficient scaling to new languages, as demonstrated through Spanish and Arabic languages. Integration of SMART in an e-commerce store resulted in reduction in product return rates, highlighting its industry impact. Our results establish SMART as an effective approach for building robust, scalable TOD systems that meet real-world requirements.

SMART: Scalable Multilingual Approach for a Robust TOD System

This paper presents lessons learned from implementing Machine Translation (MT) systems in the context of a global medical technology company. We describe system challenges, legal and security considerations, and the critical role of human-in-the-loop validation for quality assurance and responsible deployment. Furthermore, based on an experiment involving over 11,000 ranked translations, we report reviewer preferences for outputs from small and large language models under various prompting configurations, using a domain-specific dataset spanning five language pairs.

Experience report: Implementing Machine Translation in a Regulated Industry

Enterprises, public organizations, and localization providers increasingly rely on Document-level Machine Translation (DocMT) to process contracts, reports, manuals, and multimedia transcripts across languages. However, existing MT systems often struggle to handle discourse-level phenomena such as pronoun resolution, lexical cohesion, and ellipsis, resulting in inconsistent or incoherent translations. We propose **GRAFT**, a modular graph-based DocMT framework that leverages Large Language Model (LLM) agents to segment documents into discourse units, infer inter-discourse dependencies, extract structured memory, and generate context-aware translations. GRAFT transforms documents into directed acyclic graphs (DAGs) to explicitly model translation flow and discourse structure. Experiments across eight language directions and six domains show GRAFT outperforms commercial systems (e.g., textttGoogle Translate) and closed LLMs (e.g., textttGPT-4) by an average of textbf2.8 d-BLEU, and improves terminology consistency and discourse handling. GRAFT supports deployment with open-source LLMs (e.g., LLaMA, Qwen), making it cost-effective and privacy-preserving. These results position GRAFT as a robust solution for scalable, document-level translation in real-world applications.

GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

Human fact-checking is too slow to meet current demands, making automatic fact-checking system an essential alternative. Evaluating such systems is challenging as existing benchmark datasets either suffer from leakage or evidence incompleteness. This limits the realism of current evaluations. We present textbfPoliti-Fact-Only (PFO), a 5-class benchmark dataset of 2,982 political claims from politifact.com, where all post-claim analysis and annotator cues have been removed manually from evidence article. After filtration, evidence contains information available prior to the claim’s verification. By evaluating PFO, we see an average performance drop of textbf11.39\% in terms of macro-f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM-based fact-checking system, we propose textbfRAV (Recon-Answer-Verify), an agentic framework with three agents, it iteratively generates and answers sub-questions to verify different aspects of the claim before finally generating the label. Unlike prior literature, we worked on reducing the follow-up question complexity by leveraging two 2 types of structured questions, which either validate a fact or inquire about a fact. RAV generalizes across both domains and label granularities, outperforming state-of-the-art methods by textbf57.5\% on PFO textit(political, 5-class) and by textbf3.05\% on the widely used HOVER dataset textit(encyclopedic, 2-class).

Recon, Answer, Verify: Agents in Search of Truth

Large language models (LLMs) have achieved overwhelming success but require massive storage and computational resources to support the generative inference. Post-training quantization (PTQ) is a promising approach to reduce memory usage, latency and energy consumption of the deployment of LLMs. However, the presence of outliers makes most existing PTQ methods dedicated to dynamic quantization, which turns out hardware-unfriendly and often leads to large quantization errors in static scenarios. To address the above limitations, we introduce a Static Hierarchical Mix-precision Quantization method (SHMQ), which enables near-lossless and hardware-friendly compression of LLMs. Theoretically, our proposed SHMQ quantifies both inter-layer and intra-layer sensitivity through unified derivations involving Hessian. Specifically, SHMQ conducts a systematic precision allocation strategy, which seamlessly integrates coarse-grained inter-layer and fine-grained intra-layer static mix-precision quantization. Furthermore, the permutation procedure, which reorders sensitive channels and insensitive channels that share similar distribution, is leveraged to mitigate static quantization error. Our proposed SHMQ achieves 75.58\% on zero-shot reasoning tasks in W4.8A8 Qwen2.5-7B-Instruct, narrowing the accuracy gap to merely 0.13\% while yielding averaged 2.86times practical speedup.

Beyond Dynamic Quantization: An Efficient Static Hierarchical Mix-precision Framework for Near-Lossless LLM Compression

We assess whether AI systems can credibly evaluate investment risk appetite—a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes—such as country or gender—that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the low- and mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.

Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk Appetite?

Training large language models at scale suffers from costly instabilities. We introduce the R-Metric, a proactive reliability metric that predicts failures before they occur by combining hardware monitoring, training dynamics, and model performance. Achieving 0.973-1.00 F1-Score with 12-minute lead time, our lightweight approach (1.8% overhead) democratizes enterprise-grade reliability monitoring for resource-constrained organizations.

A Proactive Reliability Metric for Detecting Failures in Language Model Training

Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

People increasingly seek healthcare information from Large Language Models (LLMs), yet the nature of these conversational interactions and their inherent risks remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-14K, a curated dataset of 14K real-world conversations composed of 62K user messages. We use HealthChat-14K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study users' conversational trajectories, interaction patterns, emotional behaviors, and sycophancy-inducing interactions. Our analysis reveals insights into how users seek healthcare information, including the nature of health information users seek, their typical conversational trajectories, expressions of affect, and specific interaction patterns related to conversational challenges and leading questions, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. We will release our analyzed conversations and corresponding analysis artifacts in a curated dataset to foster future research.

Downloads

Next from EMNLP 2025

SEARA: An Automated Approach for Obtaining Optimal Retrievers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

SEARA: An Automated Approach for Obtaining Optimal Retrievers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads