China

Certifying the robustness of Deep Neural Networks (DNNs) is crucial, especially with the rise of powerful generative models, such as Large Language Models (LLMs) or Vision-Language Models (VLMs), that have the potential of generating dangerous or harmful responses. Recent work has shown that these large-scale models are still susceptible to adversarial attacks, despite their safety fine-tuning. Randomized Smoothing (RS), the current state-of-the-art (SoTA) method for robustness certification, cannot be applied on models such as VLMs: first, RS is designed for classification, not generation. Second, RS is a probabilistic approach, typically requiring 10^5 samples to certify a single input, making it infeasible for large-scale modern VLMs. This is the challenge we aim to solve in this paper. First, we reformulate RS for the case of generative models, where we distinguish between harmless and harmful responses. Moreover, we develop a theory that allows us to reduce the number of samples required by 2-3 orders of magnitude, without much effect on the certified radius, and mathematically analyze its dependence to the number of samples. Combined, these advances allow us to scale RS on state-of-the-art VLMs, something that was not feasible before. We successfully showcase this experimentally by defending against a recent SoTA attack against aligned VLMs.

EMNLP 2025

Randomized Smoothing Meets Vision-Language Models

vision-language-models

randomized smoothing

adversarial robustness

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute—improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. This was shown to boost output quality in multiple settings for English. However, the question remains about how to best apply these methods across diverse languages and tasks. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy---based on temperature variation---and selection strategy must be adapted to account for language-specific characteristics. We evaluate existing and novel selection methods, revealing that strategies effective in English often fail to generalize across languages. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Ambiguity is pervasive in language, yet we resolve it effortlessly and unconsciously, often aided by context and part-of-speech (POS) cues. This study investigates how context similarity and POS influence homonym disambiguation in humans and large language models (LLMs). To enable comparable analyses between humans and LLMs, we first built an expert-curated sentence-pair dataset, manipulating context similarity and homonym POS categories. Participants (n = 55) and LLMs (via prompting) were asked to rate the sense similarity of target homonyms embedded within each sentence on a 7-point Likert scale. We found that context similarity influenced both groups similarly, but only humans utilized POS information, likely contributing to their superior performance. Model-derived metrics (surprisal, entropy) predicted human reaction times, and angular similarity between homonym representations accounted for additional variance, highlighting the roles of both expectation-based and semantic processes. Psycholinguistic factors like age of acquisition affected only human responses, underscoring distinct language acquisition mechanisms. Together, our findings illustrate how context and POS information interactively shape homonym resolution in humans, while exposing the limitations of current language models in capturing these nuanced processes. Dataset and codes are publicly available at https://anonymous.4open.science/r/context-and-pos-in-action-976D.

Context and POS in Action: A Comparative Study of Chinese Homonym Disambiguation in Human and Language Models

We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce [ANON], a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors—including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 735 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.

Social Bias in Multilingual Language Models: A Survey

Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms. This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon—often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.

Emergent morpho-phonological representations in self-supervised speech models

Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance.

Improving Large Language Model Safety with Contrastive Representation Learning

Temporal Domain Generalization (TDG) aim at generalizing across temporal distribution shifts, e.g., lexical change over time, by predicting future models. Due to the prohibitive full model prediction cost on large-scale scenarios, recent TDG works only predict the classifier, but this limits generalization potential by failing to adjust other model components. To address this, we propose Temporal Experts Averaging (TEA), a novel TDG framework based on weight averaging that adjusts the entire model to maximize generalization potential while maintaining minimal computational overhead when scaling to large-scale datasets and models. Our theoretical analysis of weight averaging for TDG guided us to develop two steps that enhance generalization to future domains. First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes. Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace and weighting experts based on their projected proximity to future domains in the subspace. Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings reports TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient.

Scaling Up Temporal Domain Generalization via Temporal Experts Averaging

Clinical notes contain rich patient information, such as diagnoses or medications, making them valuable for *patient representation learning*. Recent advances in large language models have further improved the ability to extract meaningful representations from clinical texts. However, clinical notes are often missing—for example, 35\% of patients in real-world datasets lack them. In such cases, representations can be learned from other modalities such as structured data, chest X-rays, or radiology reports. Yet the availability of these modalities is influenced by clinical decision-making and varies across patients, resulting in modality missing-not-at-random (*MMNAR*) patterns. We propose a *causal representation learning* framework that leverages observed data and informative missingness in multimodal clinical records. It consists of: (1) a MMNAR-aware modality fusion module using large language models and other encoders to capture both patient health and reasons for missing data in representation learning; (2) a representation balancing module that improves generalization across missingness patterns, inspired by causal machine learning; and (3) a multitask prediction model, fine-tuned for each modality pattern using a rectifier to correct residual bias. On the MIMIC-IV dataset, our approach significantly outperforms recent baselines: AUC/APR increases by 16.83\%/27.21\% for hospital readmission, and by 6.86\%/10.15\% for ICU admission. Subgroup analyses confirm the value of modeling MMNAR for robust and generalizable clinical NLP.

Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness

Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.

Downloads

Next from EMNLP 2025

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES