China

Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.

EMNLP 2025

M-Wanda: Improving One-Shot Pruning for Multilingual LLMs

one-shot pruning

llm compression

multilingual llms

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Role-playing capabilities in large language models (LLMs) often lack cognitive consistency in complex scenarios that require deep understanding and coherent reasoning. While recent reasoning models excel in math and coding tasks, they show limited effectiveness in open-ended role-playing scenarios. We introduce R-CHAR (Role-Consistent Hierarchical Adaptive Reasoning), a metacognition-driven framework that enhances role-playing performance through guided thinking trajectories synthesis and adaptive evaluation. Our approach demonstrates that concise thinking processes can achieve superior performance efficiently compared to elaborate reasoning chains in role-playing social intelligence tasks, outperforming existing specialized models. Experimental results on the SocialBench benchmark show significant and stable performance improvements across varying scenario complexities, showing particular strength in long-context comprehension (from 34.64% to 68.59%) and group-level social interactions. Our work advances the development of cognitively consistent role-playing systems, bridging the gap between surface-level mimicry and authentic character simulation.

R-CHAR: A Metacognition-Driven Framework for Role-Playing in Large Language Models

Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.

Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training

Vision-language models (VLMs) are highly effective at semantic reasoning but struggle with a basic perceptual skill: recognizing hidden content in optical illusions and camouflaged images, which humans can perceive through simple adjustments like squinting or zooming. We introduce HC-Bench, a benchmark of over 1,200 images containing hidden text, objects, and illusions. Our evaluation across 11 state-of-the-art VLMs shows near-zero accuracy even when explicit prompts are provided, in stark contrast to human performance. Surprisingly, we find that downscaling the input image to a low resolution (32–128 pixels) restores model accuracy to over 99%. Additional experiments, including fine-tuning and image blurring, support the hypothesis that high-resolution inputs introduce redundant local features that interfere with global pattern recognition. These findings reveal a critical architectural blind spot in current VLMs and point toward the need for hybrid models with multi-scale visual processing. Our results have implications for applications in medical imaging, security, and other real-world settings that require robust visual understanding.

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying cognitive mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce CogDual, a novel RPLA adopting a cognize-then-respond reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.

CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.

You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models

Certifying the robustness of Deep Neural Networks (DNNs) is crucial, especially with the rise of powerful generative models, such as Large Language Models (LLMs) or Vision-Language Models (VLMs), that have the potential of generating dangerous or harmful responses. Recent work has shown that these large-scale models are still susceptible to adversarial attacks, despite their safety fine-tuning. Randomized Smoothing (RS), the current state-of-the-art (SoTA) method for robustness certification, cannot be applied on models such as VLMs: first, RS is designed for classification, not generation. Second, RS is a probabilistic approach, typically requiring 10^5 samples to certify a single input, making it infeasible for large-scale modern VLMs. This is the challenge we aim to solve in this paper. First, we reformulate RS for the case of generative models, where we distinguish between harmless and harmful responses. Moreover, we develop a theory that allows us to reduce the number of samples required by 2-3 orders of magnitude, without much effect on the certified radius, and mathematically analyze its dependence to the number of samples. Combined, these advances allow us to scale RS on state-of-the-art VLMs, something that was not feasible before. We successfully showcase this experimentally by defending against a recent SoTA attack against aligned VLMs.

Randomized Smoothing Meets Vision-Language Models

Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute—improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. This was shown to boost output quality in multiple settings for English. However, the question remains about how to best apply these methods across diverse languages and tasks. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy---based on temperature variation---and selection strategy must be adapted to account for language-specific characteristics. We evaluate existing and novel selection methods, revealing that strategies effective in English often fail to generalize across languages. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Ambiguity is pervasive in language, yet we resolve it effortlessly and unconsciously, often aided by context and part-of-speech (POS) cues. This study investigates how context similarity and POS influence homonym disambiguation in humans and large language models (LLMs). To enable comparable analyses between humans and LLMs, we first built an expert-curated sentence-pair dataset, manipulating context similarity and homonym POS categories. Participants (n = 55) and LLMs (via prompting) were asked to rate the sense similarity of target homonyms embedded within each sentence on a 7-point Likert scale. We found that context similarity influenced both groups similarly, but only humans utilized POS information, likely contributing to their superior performance. Model-derived metrics (surprisal, entropy) predicted human reaction times, and angular similarity between homonym representations accounted for additional variance, highlighting the roles of both expectation-based and semantic processes. Psycholinguistic factors like age of acquisition affected only human responses, underscoring distinct language acquisition mechanisms. Together, our findings illustrate how context and POS information interactively shape homonym resolution in humans, while exposing the limitations of current language models in capturing these nuanced processes. Dataset and codes are publicly available at https://anonymous.4open.science/r/context-and-pos-in-action-976D.

Context and POS in Action: A Comparative Study of Chinese Homonym Disambiguation in Human and Language Models

We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce [ANON], a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.

Downloads

Next from EMNLP 2025

R-CHAR: A Metacognition-Driven Framework for Role-Playing in Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

R-CHAR: A Metacognition-Driven Framework for Role-Playing in Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads