China

Large Language Models (LLMs) have demonstrated impressive performance across various domains. 
However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. 
Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), reducing memory usage but causing performance degradation. 
Additionally, converting fine-tuned models to low-precision representations further degrades performance. 
In this paper, we identify an imbalance in fine-tuning quantized LLMs with LoRA: overly complex adapter inputs and outputs versus low effective trainability of the adapter, leading to underfitting during fine-tuning.
Thus, we propose Quantized LLMs fine-tuning with Balanced Low-Rank Adaptation (Q-BLoRA), which simplifies the adapter’s inputs and outputs while increasing the adapter’s rank to alleviate underfitting during fine-tuning. 
For low-precision deployment, we propose Quantization-Aware fine-tuning with Balanced Low-Rank Adaptation (QA-BLoRA), which aligns with the block-wise quantization and facilitates quantization-aware fine-tuning of low-rank adaptation based on the parameter merging of Q-BLoRA.
Both Q-BLoRA and QA-BLoRA are easily implemented and offer the following optimizations: (i) Q-BLoRA consistently achieves state-of-the-art accuracy compared to baselines and other variants; (ii) QA-BLoRA enables the direct generation of low-precision inference models, which exhibit significant performance improvements over other low-precision models.
We validate the effectiveness of Q-BLoRA and QA-BLoRA across various models and scenarios.
Code will be made available at https://github.com/xiaocaigou/qbaraqahira.

EMNLP 2025

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance in Adaptation

low-precision deployment

large language models fine-tuning

low-rank adaptation

underfitting

quantization-aware training

quantization

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Argument(ation) mining (AM) is the automated process of identification and extraction of argumentative structures in natural language. This field has seen rapid advancements, offering powerful tools to analyze and interpret complex and large discourse in diverse domains (political debates, medical reports, etc.). In this paper we introduce an AM-boosted version of BCause, a large-scale deliberation platform. The system enables the extraction and analysis of arguments from online discussions in the context of deliberative democracy, which aims to enhance the understanding and accessibility of structured argumentation in large-scale deliberation processes.

AM4DSP: Argumentation Mining in Structured Decentralized Discussion Platforms for Deliberative Democracy

Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation.
To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.

On-device System of Compositional Multi-tasking in Large Language Models

Predicting the user’s shopping intent is a crucial 
task in e-commerce. In particular determining 
the product category, which the user wants to 
shop, is essential for delivering relevant search 
results and website navigation options. 
Existing query classification models are reported to have excellent predictive performance
on the single-intent queries (e.g. ‘running 
shoes’), but there is only little research on predicting multiple-intents for a broad query (e.g.
‘running gear’). Although the training data 
for broad query classification can be easily obtained, the evaluation of multi-label categorization remains challenging, as the set of true labels for multi-intent queries is subjective and 
ambiguous. 


In this work we propose an automatic method 
of creating the evaluation data for multi-label e-
commerce query classification. We reduce the 
ambiguity of the annotations by blending the 
label assessment from three different sources: 
user click data, query-item relevance and LLM 
judgments.

FABRIC: Fully-Automated Broad Intent Categorization in E-commerce

We present a scalable large language model (LLM)-based system that combines aspect-based sentiment analysis (ABSA) with guided summarization to generate concise and interpretable product review summaries. Our approach first extracts and consolidates aspect–sentiment pairs from individual reviews, selects the most frequent aspects for each product, and samples representative reviews accordingly. These are used to construct structured prompts that guide the LLM to produce summaries grounded in actual customer feedback. We demonstrate the real-world effectiveness of our system through a large-scale online A/B test. Furthermore, we describe our real-time deployment strategy and release a dataset of 8 million anonymized customer reviews covering 97000 products, including extracted aspects and generated summaries, to support future research in aspect-guided review summarization.

End-to-End Aspect-Guided Review Summarization at Scale

The growing volume of daily disclosed software vulnerabilities imposes significant pressure on security analysts, extending the time needed for analysis - an essential step for accurate risk prioritization.
Meanwhile, the time between disclosure and exploitation is reducing, becoming shorter than the analysis time and increasing the window of opportunity for attackers.
This study explores leveraging Large Language Models (LLMs) for automating vulnerability risk score prediction using the industrial CVSS standard.
From our analysis across different data availability scenarios, LLMs can effectively complement supervised baselines in data-scarce settings. In the absence of any annotated data, such as during the transition to new versions of the standard, LLMs are the only viable approach, highlighting their value in improving vulnerability management.
We make the source code of AutoCVSS public.

AutoCVSS: Assessing the Performance of LLMs for Automated Software Vulnerability Scoring

In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization for knowledge editing, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference-optimization yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. 
Additionally, we release test sets from our post-edited data and terminology dictionary.

Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Retrieval-Augmented Generation (RAG) is a core approach for enhancing Large Language Models (LLMs), where the effectiveness of the retriever largely determines the overall response quality of RAG systems. Retrievers encompass a multitude of hyperparameters that significantly impact performance outcomes and demonstrate sensitivity to specific applications. Nevertheless, hyperparameter optimization entails prohibitively high computational expenses. Existing evaluation methods suffer from either prohibitive costs or disconnection from domain-specific scenarios. This paper proposes SEARA (Subset sampling Evaluation for Automatic Retriever Assessment), which addresses evaluation data challenges through subset sampling techniques and achieves robust automated retriever evaluation by minimal retrieval facts extraction and comprehensive retrieval metrics. Based on real user queries, this method enables fully automated retriever evaluation at low cost, thereby obtaining optimal retriever for specific business scenarios. We validate our method across classic RAG applications in rednote, including knowledge-based Q&A system and retrieval-based travel assistant, successfully obtaining scenario-specific optimal retrievers.

SEARA: An Automated Approach for Obtaining Optimal Retrievers

Task-Oriented Dialogue (TOD) systems have become increasingly important for real-world applications, yet existing frameworks face significant challenges in handling unstructured information, providing multilingual support, and engaging proactively. We propose SMART (Scalable Multilingual Approach for a Robust TOD System), a novel TOD framework that effectively addresses these limitations. SMART combines traditional pipeline elements with modern agent-based approaches, featuring a simplified dialogue state, intelligent clarification mechanisms, and a unified natural language generation component that eliminates response redundancy. Through comprehensive evaluation on troubleshooting and medical domains, we demonstrate that SMART outperforms baseline systems across key metrics. The system's modular approach enables efficient scaling to new languages, as demonstrated through Spanish and Arabic languages. Integration of SMART in an e-commerce store resulted in reduction in product return rates, highlighting its industry impact. Our results establish SMART as an effective approach for building robust, scalable TOD systems that meet real-world requirements.

SMART: Scalable Multilingual Approach for a Robust TOD System

This paper presents lessons learned from implementing Machine Translation (MT) systems in the context of a global medical technology company. We describe system challenges, legal and security considerations, and the critical role of human-in-the-loop validation for quality assurance and responsible deployment. Furthermore, based on an experiment involving over 11,000 ranked translations, we report reviewer preferences for outputs from small and large language models under various prompting configurations, using a domain-specific dataset spanning five language pairs.

Experience report: Implementing Machine Translation in a Regulated Industry

Enterprises, public organizations, and localization providers increasingly rely on Document-level Machine Translation (DocMT) to process contracts, reports, manuals, and multimedia transcripts across languages. However, existing MT systems often struggle to handle discourse-level phenomena such as pronoun resolution, lexical cohesion, and ellipsis, resulting in inconsistent or incoherent translations. We propose **GRAFT**, a modular graph-based DocMT framework that leverages Large Language Model (LLM) agents to segment documents into discourse units, infer inter-discourse dependencies, extract structured memory, and generate context-aware translations. GRAFT transforms documents into directed acyclic graphs (DAGs) to explicitly model translation flow and discourse structure. Experiments across eight language directions and six domains show GRAFT outperforms commercial systems (e.g., textttGoogle Translate) and closed LLMs (e.g., textttGPT-4) by an average of textbf2.8 d-BLEU, and improves terminology consistency and discourse handling. GRAFT supports deployment with open-source LLMs (e.g., LLaMA, Qwen), making it cost-effective and privacy-preserving. These results position GRAFT as a robust solution for scalable, document-level translation in real-world applications.

Downloads

Next from EMNLP 2025

AM4DSP: Argumentation Mining in Structured Decentralized Discussion Platforms for Deliberative Democracy

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES