China

The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision language models (LVLMs) to generate Embodied LVLM Emotion Narratives (textbfELENA). These are well-defined, multi-layered text outputs, mainly comprising narrative-based descriptions focused on the salient body parts involved in emotional reaction. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that LVLMs can effectively recognize embodied emotions in masked images compared to baseline naive prompts. They achieve this without any fine-tuning when guided by the ELENA framework. ELENA charts a new trajectory for embodied-emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.

EMNLP 2025

Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models

embodied emotion

cross-modal content generation

multimodality

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Knowledge discovery from large-scale, heterogeneous textual corpora presents a significant challenge. Document clustering offers a practical solution by organizing unstructured texts into coherent groups based on content and thematic similarity. However, clustering does not inherently ensure thematic consistency. Here, we propose a novel framework that constructs a similarity graph over document embeddings and applies iterative graph-based clustering algorithms to partition the corpus into initial clusters. To overcome the limitations of conventional methods in producing semantically consistent clusters, we incorporate iterative feedback from a large language model (LLM) to guide the refinement process. The LLM is used to assess cluster quality and adjust edge weights within the graph, promoting better intra-cluster cohesion and inter-cluster separation. The LLM guidance is based on a set of success Rate metrics that we developed to measure the semantic coherence of clusters. Experimental results on multiple benchmark datasets demonstrate that the iterative process and additional user-supplied a priori edges improve the summaries' consistency and fluency, highlighting the importance of known connections among the documents. The removal of very rare or very frequent sentences has a mixed effect on the quality scores. Our full code is available here: \url{https://github.com/D2CS-sub/D2CS}

D2CS - Documents Graph Clustering using LLM supervision

The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained LLMs for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.

Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents

Machine translation systems inevitably make translation errors. Studying the errors paves the important way towards building the error-free translation systems. Current fine-grained error analyses by LLMs gain more and more attention in machine translation, but these analyses do not ground the errors to the reasons why the annotated text spans are erroneous. In this paper, we evaluate whether LLMs really know such reasons when grounding the translation errors by manually building an evaluation resource through a bi-directional grounding scheme. In the forward direction, we annotate the explanation of the reason for each error span. In the backward direction, we annotate the error span given its explanation, in which the error span is masked. If the error spans of both directions are consistent, we deem the explanation is valid. Such grounding process can regulate the explanation so as to avoid the subjective bias. We evaluate LLMs grounding ability on this resource, and the results show that LLMs perform significantly worse than human in both directions. Furthermore, we apply the error grounding for filtering false alarmed errors, and achieve significant improvement in translation error detection.

An Evaluation Resource for Grounding Translation Errors

In this paper, we reported our experiments with various strategies to improve code-mixed humour and sarcasm detection. Particularly, we tried three approaches: (i) native sample mixing, (ii) multi-task learning (MTL), and (iii) prompting and instruction finetuning very large multilingual language models (VMLMs). In native sample mixing, we added monolingual task samples to code-mixed training sets. In MTL learning, we relied on native and code-mixed samples of a semantically related task (hate detection in our case). Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot context prompting and instruction finetuning. Some interesting findings we got are (i) adding native samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework boosted performance for both humour (raising the F1-score up to 10.67%) and sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting and instruction finetuning VMLMs couldn't outperform the other approaches. Finally, our ablation studies and error analysis discovered the cases where our model is yet to improve. We provided our code for reproducibility.

Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection

Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units—cognitive atoms—extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem generation.

CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models

Empathy, a key communicative competence underlying human languages, has fueled debates about the link between language type and pragmatic competence. Despite this longstanding interest, the connection between language and empathy — especially the pragmatic transferability of empathy — remains unclear and underexplored in current NLP studies. Their over-reliance on English data leaves a gap in understanding empathy from a cross-lingual pragmatic perspective. Chinese languages, spoken by over one billion people in daily life, still lack human-produced recordings in terms of empathetic communication. In light of the Dual-Iceberg Model of ``Linguistic Interdependence Hypothesis'', human multilingualism builds on the shared pragmatic common ground among different languages. Inspired by the Dual-Iceberg hypothesis and the cognitive underpinning of human multilingualism, we integrate language-independent diffusion processes to probe and facilitate the language model's pragmatic transfer. Automatic and human evaluations demonstrate successful transfers of non-Chinese empathy into Chinese contexts without compromising linguistic naturalness. The results of this work demonstrate the cross-lingual shared pragmatics through the lens of large language models. It also provides a language-disentangled analytic framework to conduct inter-lingual comparisons of conversation in latent representation space and in any shared languages.

Facilitating Cross-lingual Transfer of Empathy through Language-independent Latent Diffusion: A Case Study in Chinese

Machine translation quality has steadily improved over the years, with some recent benchmarks indicating that machine translation models produce near-perfect translations. Such error-free outputs are not useful for distinguishing between models and assessing whether there is still room for improvement in the field. Being able to automatically create difficult test sets holds promise for developing more discriminative evaluations. Unfortunately, reliable methods for automatically estimating translation difficulty do not exist yet, and no previous research has conducted a broad investigation into which approaches are the most effective. In this work, we formalize the task of translation difficulty estimation, defining the difficulty of a text by the quality of its translations. We evaluate baseline and novel methods intrinsically (with a dedicated evaluation measure), and as a tool for constructing challenging machine translation benchmarks. Our experiments demonstrate that dedicated models vastly outperform both heuristic-based methods, such as word rarity and syntactic complexity, and LLM-as-a-Judge approaches. Practically, given a large collection of source texts, our difficulty estimators are able to select examples where machine translation models underperform.

Estimating Machine Translation Difficulty

While large language models (LLMs) excel at machine translation (MT), the impact of how LLMs utilize different forms of contextual information on discourse-level phenomena remains underexplored. We systematically investigate how different forms of context such as prior source sentences, models' generated hypotheses, and reference translations influence standard MT metrics and specific discourse phenomena (formality, pronoun selection, and lexical cohesion). Evaluating multiple LLMs across multiple domains and language pairs, our findings consistently show that context boosts both translation and discourse-specific performance. Notably, the context strategy of combining source text with the model's own prior hypotheses effectively improves discourse consistency without gold references, demonstrating effective use of model's own imperfect generations as diverse contextual cues.

Exploring Context Strategies in LLMs for Discourse-Aware Machine Translation

Word Meaning Negotiations (WMN) are sequences in conversation where speakers collectively discuss and shape word meaning. These exchanges can provide insight into conversational dynamics and word-related misunderstandings, but they are hard to find in corpora. In order to facilitate and speed up the data collection and annotation process, we introduce the task of detecting WMN indicators – utterances where a speaker signals the need to clarify or challenge word meaning. We train a wide range of models and reveal the difficulty of the task. Our models have better precision than previous regular-expression based approaches and show some generalization abilities, but have moderate recall. However, this constitutes a promising first step toward an iterative process for obtaining more data.

Downloads

Next from EMNLP 2025

D2CS - Documents Graph Clustering using LLM supervision

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES