China

The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained LLMs for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

EMNLP 2025

Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

low rank compression

efficient llm

tensor factorization

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.

Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents

Machine translation systems inevitably make translation errors. Studying the errors paves the important way towards building the error-free translation systems. Current fine-grained error analyses by LLMs gain more and more attention in machine translation, but these analyses do not ground the errors to the reasons why the annotated text spans are erroneous. In this paper, we evaluate whether LLMs really know such reasons when grounding the translation errors by manually building an evaluation resource through a bi-directional grounding scheme. In the forward direction, we annotate the explanation of the reason for each error span. In the backward direction, we annotate the error span given its explanation, in which the error span is masked. If the error spans of both directions are consistent, we deem the explanation is valid. Such grounding process can regulate the explanation so as to avoid the subjective bias. We evaluate LLMs grounding ability on this resource, and the results show that LLMs perform significantly worse than human in both directions. Furthermore, we apply the error grounding for filtering false alarmed errors, and achieve significant improvement in translation error detection.

An Evaluation Resource for Grounding Translation Errors

In this paper, we reported our experiments with various strategies to improve code-mixed humour and sarcasm detection. Particularly, we tried three approaches: (i) native sample mixing, (ii) multi-task learning (MTL), and (iii) prompting and instruction finetuning very large multilingual language models (VMLMs). In native sample mixing, we added monolingual task samples to code-mixed training sets. In MTL learning, we relied on native and code-mixed samples of a semantically related task (hate detection in our case). Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot context prompting and instruction finetuning. Some interesting findings we got are (i) adding native samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework boosted performance for both humour (raising the F1-score up to 10.67%) and sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting and instruction finetuning VMLMs couldn't outperform the other approaches. Finally, our ablation studies and error analysis discovered the cases where our model is yet to improve. We provided our code for reproducibility.

Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection

Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units—cognitive atoms—extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem generation.

CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models

Empathy, a key communicative competence underlying human languages, has fueled debates about the link between language type and pragmatic competence. Despite this longstanding interest, the connection between language and empathy — especially the pragmatic transferability of empathy — remains unclear and underexplored in current NLP studies. Their over-reliance on English data leaves a gap in understanding empathy from a cross-lingual pragmatic perspective. Chinese languages, spoken by over one billion people in daily life, still lack human-produced recordings in terms of empathetic communication. In light of the Dual-Iceberg Model of ``Linguistic Interdependence Hypothesis'', human multilingualism builds on the shared pragmatic common ground among different languages. Inspired by the Dual-Iceberg hypothesis and the cognitive underpinning of human multilingualism, we integrate language-independent diffusion processes to probe and facilitate the language model's pragmatic transfer. Automatic and human evaluations demonstrate successful transfers of non-Chinese empathy into Chinese contexts without compromising linguistic naturalness. The results of this work demonstrate the cross-lingual shared pragmatics through the lens of large language models. It also provides a language-disentangled analytic framework to conduct inter-lingual comparisons of conversation in latent representation space and in any shared languages.

Facilitating Cross-lingual Transfer of Empathy through Language-independent Latent Diffusion: A Case Study in Chinese

Machine translation quality has steadily improved over the years, with some recent benchmarks indicating that machine translation models produce near-perfect translations. Such error-free outputs are not useful for distinguishing between models and assessing whether there is still room for improvement in the field. Being able to automatically create difficult test sets holds promise for developing more discriminative evaluations. Unfortunately, reliable methods for automatically estimating translation difficulty do not exist yet, and no previous research has conducted a broad investigation into which approaches are the most effective. In this work, we formalize the task of translation difficulty estimation, defining the difficulty of a text by the quality of its translations. We evaluate baseline and novel methods intrinsically (with a dedicated evaluation measure), and as a tool for constructing challenging machine translation benchmarks. Our experiments demonstrate that dedicated models vastly outperform both heuristic-based methods, such as word rarity and syntactic complexity, and LLM-as-a-Judge approaches. Practically, given a large collection of source texts, our difficulty estimators are able to select examples where machine translation models underperform.

Estimating Machine Translation Difficulty

While large language models (LLMs) excel at machine translation (MT), the impact of how LLMs utilize different forms of contextual information on discourse-level phenomena remains underexplored. We systematically investigate how different forms of context such as prior source sentences, models' generated hypotheses, and reference translations influence standard MT metrics and specific discourse phenomena (formality, pronoun selection, and lexical cohesion). Evaluating multiple LLMs across multiple domains and language pairs, our findings consistently show that context boosts both translation and discourse-specific performance. Notably, the context strategy of combining source text with the model's own prior hypotheses effectively improves discourse consistency without gold references, demonstrating effective use of model's own imperfect generations as diverse contextual cues.

Exploring Context Strategies in LLMs for Discourse-Aware Machine Translation

Word Meaning Negotiations (WMN) are sequences in conversation where speakers collectively discuss and shape word meaning. These exchanges can provide insight into conversational dynamics and word-related misunderstandings, but they are hard to find in corpora. In order to facilitate and speed up the data collection and annotation process, we introduce the task of detecting WMN indicators – utterances where a speaker signals the need to clarify or challenge word meaning. We train a wide range of models and reveal the difficulty of the task. Our models have better precision than previous regular-expression based approaches and show some generalization abilities, but have moderate recall. However, this constitutes a promising first step toward an iterative process for obtaining more data.

Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation

Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Temporal Coherence. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall’s τb, τc, and Spearman’s ρ. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues—a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question. We will release our dataset and code.

Downloads

Next from EMNLP 2025

Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES