China

We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce [ANON], a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.

EMNLP 2025

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

multilingual mt

human evaluation

automatic evaluation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors—including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 735 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.

Social Bias in Multilingual Language Models: A Survey

Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms. This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon—often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.

Emergent morpho-phonological representations in self-supervised speech models

Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance.

Improving Large Language Model Safety with Contrastive Representation Learning

Temporal Domain Generalization (TDG) aim at generalizing across temporal distribution shifts, e.g., lexical change over time, by predicting future models. Due to the prohibitive full model prediction cost on large-scale scenarios, recent TDG works only predict the classifier, but this limits generalization potential by failing to adjust other model components. To address this, we propose Temporal Experts Averaging (TEA), a novel TDG framework based on weight averaging that adjusts the entire model to maximize generalization potential while maintaining minimal computational overhead when scaling to large-scale datasets and models. Our theoretical analysis of weight averaging for TDG guided us to develop two steps that enhance generalization to future domains. First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes. Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace and weighting experts based on their projected proximity to future domains in the subspace. Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings reports TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient.

Scaling Up Temporal Domain Generalization via Temporal Experts Averaging

Clinical notes contain rich patient information, such as diagnoses or medications, making them valuable for *patient representation learning*. Recent advances in large language models have further improved the ability to extract meaningful representations from clinical texts. However, clinical notes are often missing—for example, 35\% of patients in real-world datasets lack them. In such cases, representations can be learned from other modalities such as structured data, chest X-rays, or radiology reports. Yet the availability of these modalities is influenced by clinical decision-making and varies across patients, resulting in modality missing-not-at-random (*MMNAR*) patterns. We propose a *causal representation learning* framework that leverages observed data and informative missingness in multimodal clinical records. It consists of: (1) a MMNAR-aware modality fusion module using large language models and other encoders to capture both patient health and reasons for missing data in representation learning; (2) a representation balancing module that improves generalization across missingness patterns, inspired by causal machine learning; and (3) a multitask prediction model, fine-tuned for each modality pattern using a rectifier to correct residual bias. On the MIMIC-IV dataset, our approach significantly outperforms recent baselines: AUC/APR increases by 16.83\%/27.21\% for hospital readmission, and by 6.86\%/10.15\% for ICU admission. Subgroup analyses confirm the value of modeling MMNAR for robust and generalizable clinical NLP.

Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness

Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.

XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering

As large language models (LLMs) integrate to society, understanding its awareness of context is fundamental to ensure safety and alignment. Past research has focused on situational awareness to examine the LLMs ability to recognize itself and circumstances but the ability to recognizing the conversational partner is overlooked. In this study, we introduce interlocutor awareness, the ability of LLMs to recognize and adapt to the identity and capabilities of their conversational partners, and present the first systematic evaluation of this phenomenon. Specifically, we first assess the capability of LLMs to infer the identity of their interlocutor across three tasks: mathematical reasoning, code completion, and conversational inference. Subsequently, we evaluate behavioral adaptation through interlocutor awareness---where LLMs modify their behavior based on who they are interacting with---along two dimensions: collaborative adaptation assessing whether ``sender'' models tailor their explanations within controlled math-solving frameworks, and adversarial tactics, which examine how knowledge of the interlocutor's identity influences a model's success at jailbreak. Our evaluation demonstrates that LLMs reliably identify same-family peers and tend to adapt their behavior based on the identity of their interaction partner. While our findings highlight the potential benefits of interlocutor awareness for optimizing multi-LLM collaboration, they also reveal novel risks related to AI safety and control.

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.

Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches

In the age of misinformation, hallucination---the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses---represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild'' than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

Downloads

Next from EMNLP 2025

Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES