China

Jailbreaking, the phenomenon where specific prompts cause LLMs to assist with harmful requests, remains a critical challenge in NLP, particularly in non-English and lower-resourced languages. To address this, we introduce MULBERE, a method that extends the method of Targeted Latent Adversarial Training (T-LAT) to a multilingual context. We first create and share a multilingual jailbreak dataset spanning high-, medium-, and low-resource languages, and then fine-tune LLaMA-2-7b-chat with interleaved T-LAT for jailbreak robustness and chat examples for model performance. Our evaluations show that MULBERE reduces average multilingual jailbreak success rates by 75\% compared to the base LLaMA safety training and 71\% compared to English-only T-LAT while maintaining or improving standard LLM performance.

EMNLP 2025

MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training

latent adversarial training

jailbreak

low-resource languages

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Language models perpetuate dialect bias, associating African American English (AAE) with negative traits and outcomes. We propose JustDial, a lightweight finetuning framework aligning character trait associations between meaning-matched AAE and Standardized American English (SAE) text, while preserving general model fluency through a KL-divergence regularization term. Experiments on GPT2-Medium show that JustDial successfully removed any statistically significant correlation between dialect and predicted occupational prestige and reduced conviction and death-sentencing disparities by more than 98.7\%, with only 100,000 text examples and one epoch of LoRA finetuning. Though this debiasing comes at the cost of general model performance, adjusting the regularization term in JustDial enables a navigable debiasing-performance tradeoff space. JustDial provides the first proof-of-concept towards mitigating dialect prejudice in language models.

JustDial: Language Model Dialect Debiasing Using Biased Character Trait Associations

Multilingual Pre-trained Language Models (multiPLMs) trained with the Masked Language Modeling (MLM) objective exhibit suboptimal performance on cross-lingual downstream tasks for Low-Resource Languages (LRLs). Continually pre-training these multiPLMs with the Translation Language Modeling (TLM) objective on parallel data improves the cross-lingual performance. However, both MLM and TLM mask tokens randomly, which does not guarantee optimal representation learning. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entities nouns, verbs and Named Entities, which hold a higher prominence in a sentence. We hypothesise that masking linguistically significant linguistic entities should contribute to effective representation learning. Empirically, we prove this using two downstream tasks with three LRL pairs, English-Sinhala, English-Tamil, and Sinhala-Tamil, and show that our LEM-based learning returns superior results compared to MLM+TLM.

Linguistic Entity Masking to improve Cross-Lingual representations in encoder-based LLMs

Our desires often influence our beliefs and expectations. Humans tend to think good things are more likely to happen than they actually are, while believing bad things are less likely. This tendency has been referred to as wishful thinking in research on coping strategies. With large language models (LLMs) increasingly being considered as computational models of human cognition, we investigate whether they can simulate this distinctly human bias. We conducted two systematic experiments across multiple LLMs, manipulating outcome desirability and information uncertainty across multiple scenarios including probability games, natural disasters, and sports events. Our experiments revealed limited wishful thinking in LLMs. In Experiment 1, only two models showed the bias, and only in sports-related scenarios when role-playing characters. Models exhibited no wishful thinking in mathematical contexts. Experiment 2 found that explicit prompting about emotional states (being hopeful) was necessary to elicit wishful thinking in logical domains. These findings reveal a significant gap between human cognitive biases and LLMs' default behavior patterns, suggesting that current models require explicit guidance to simulate wishful thinking influences on belief formation.

Investigating Motivated Inference in Large Language Models

We introduce a multilingual benchmark for evaluating large language models (LLMs) on hate speech detection and generation in low-resource Ethiopian languages: Afaan Oromo, Amharic and Tigrigna, and English (both monolingual and code-mixed). Using a balanced and expert-annotated dataset, we assess five state-of-the-art LLM families across both tasks. Our results show that while LLMs perform well on English detection, their performance on low-resource languages is significantly weaker, revealing that increasing model size alone does not ensure multilingual robustness. More critically, we find that all models, including closed and open-source variants, can be prompted to generate profiled hate speech with minimal resistance. These findings underscore the dual risk of exclusion and exploitation: LLMs fail to protect low-resource communities while enabling scalable harm against them. We make our evaluation framework available to facilitate future research on multilingual model safety and ethical robustness.

Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages

We investigate how Visual Language Models (VLMs) leverage visual features when making analogical comparisons about people. Using synthetic images of individuals varying in skin tone and nationality, we prompt GPT and Gemini models to make analogical associations with desserts and drinks. Results reveal that VLMs systematically associate darker-skinned individuals with brown-colored food items, with GPT showing stronger associations than Gemini. These biases are amplified in Thai versus English prompts, suggesting language-dependent encoding of visual stereotypes. The associations persist across manipulation checks including position swapping and clothing changes, though presenting individuals alone yields divergent language-specific patterns. This work reveals concerning biases in VLMs' visual reasoning that vary by language, with important implications for multilingual deployment.

Brown Like Chocolate: How Vision-Language Models Associate Skin Tone with Food Colors

Multilingual dense embedding models such as Multilingual E5, LaBSE, and BGE-M3 have shown promising results on diverse benchmarks for information retrieval in low-resource languages. But their result on low resource languages is not up to par with other high resource languages. This work improves the performance of BGE-M3 through contrastive fine-tuning; the model was selected because of its superior performance over other multilingual embedding models across MIRACL, MTEB, and SEB benchmarks. To fine-tune this model, we curated a comprehensive dataset comprising Yorùbá (32.9k rows), Igbo (18k rows) and Hausa (85k rows) from mainly news sources. We further augmented our multilingual dataset with English queries and mapped it to each of the Yoruba, Igbo, and Hausa documents, enabling cross-lingual semantic training. The fine-tuned model increased the mean reciprocal rank (MRR): 0.9201 for Yorùbá, 0.8638 for Igbo, 0.9230 for Hausa, and 0.8617 for English to local retrieval; surpassing the baseline BGE-M3 scores of 0.7846, 0.7566, 0.8575, and 0.7377, respectively. The resulting model supports multilingual search, question answering, and other local semantic applications. We release the final dataset, scraping and processing scripts, and fine-tuned model weights.

Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages

Sandhi, the phonological merging of morphemes, is a central feature of Sanskrit grammar. While Sandhi formation is well-defined by Pāṇini’s Aṣṭādhyāyī, the reverse task—Sandhi splitting—is substantially more complex due to inherent ambiguity and context-sensitive transformations. Accurate splitting is a critical precursor to tokenization in Sanskrit, which lacks explicit word boundaries and presents densely fused compounds. In this work, we present a data-driven approach, fine-tuning the Gemma-3 4B large language model on a dataset of over 49,000 training and 2,000 test examples of compound words and their morpheme-level decompositions. Leveraging the Unsloth framework with low-rank adaptation (LoRA) and 4-bit quantization, we train the model to predict these splits. Our work yields a scalable, Sandhi-aware system designed to enhance modern NLP pipelines for classical Sanskrit, demonstrating an effective application of LLMs to this linguistic challenge.

The Gemma Sutras: Fine-Tuning Gemma 3 for Sanskrit Sandhi Splitting

Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports.
In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI‘s Deep Search and Google’s Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, as well as the shortcomings in representing the targeted area.

Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Natural language explanations (NLEs) are widely used to communicate model reasoning to humans, but they may also serve as effective control signals for improving model performance. In this paper, we present the first comprehensive evaluation of NLEs as prompts in in-context learning (ICL), comparing human-annotated, self-generated, and LLM-generated NLEs across five reasoning benchmarks and three instruction-tuned models (Llama 3 8B, Llama 3 70B, GPT-4o-mini). Our preliminary results show that LLM-generated explanations, especially those from GPT-4o-mini, yield the highest gains across tasks. We further plan to measure how the faithfulness of self-explanations strongly correlates to its utility, and if models retain partial robustness even when rationales are randomly mismatched or adversarially swapped.

Downloads

Next from EMNLP 2025

JustDial: Language Model Dialect Debiasing Using Biased Character Trait Associations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

JustDial: Language Model Dialect Debiasing Using Biased Character Trait Associations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads