China

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting.

Stereotypes vary across cultural contexts, making it essential to understand how language models encode social biases. MultiLingualCrowsPairs is a dataset of culturally adapted stereotypical and anti-stereotypical sentence pairs across nine languages. While prior work has primarily reported average fairness metrics on masked language models, this paper analyzes social biases in generative models by disaggregating results across specific bias types.

We find that although most languages show an overall preference for stereotypical sentences, this masks substantial variation across different types of bias, such as gender, religion, and socioeconomic status. Our findings underscore that relying solely on aggregated metrics can obscure important patterns, and that fine-grained, bias-specific analysis is critical for meaningful fairness evaluation.

EMNLP 2025

Insights from a Disaggregated Analysis of Kinds of Biases in a Multicultural Dataset

cultural contexts

bias types

multicultural dataset

social biases

fairness evaluation

stereotypes

generative models

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

As Large Language Models (LLMs) gain mainstream public usage, understanding how users interact with them becomes increasingly important. Limited variety in training data raises concerns about LLM reliability across different language inputs. To explore this, we test several LLMs using functionally equivalent prompts expressed in different English sublanguages. We frame this analysis using Question-Answer (QA) pairs, which allow us to detect and evaluate appropriate and anomalous model behavior. We contribute a cross-LLM testing method and a new QA dataset translated into AAVE and WAPE variants. Early results reveal a notable drop in accuracy for one sublanguage relative to the baseline.

That Ain't Right: Assessing LLM Performance on QA in African American and West African English Dialects

News classification is a key task in Natural Language Processing (NLP) that involves the automatic categorization of articles into predefined topics. While extensively studied in high-resource languages such as English, low-resource languages such as Amharic face significant challenges owing to limited datasets and the absence of language-specific model adaptations.
This study contributes to Amharic news classification by expanding the existing Amharic news dataset from 50,000 to 144,000 articles, nearly tripling its size. We evaluate five transformer-based models—mBERT, XLM-R, DistilBERT, AfriBERTa, and AfroXLM—on this enriched dataset to improve classification accuracy and effectiveness.
The model performance was assessed using accuracy, precision, recall, and F1-score, with careful attention to computational efficiency and overfitting. Among the models, AfriBERTa and XLM-R achieved the highest F1-scores of 90.25\% and 90.11\%, respectively, significantly surpassing the Naïve Bayes baseline, which reached only 60.3\% accuracy on the former dataset. These findings underscore the importance of both advanced transformer architectures and larger, high-quality datasets in boosting classification performance for under-resourced languages such as Amharic, offering valuable insights and resources for future NLP research.

Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks

Sarcasm is a complex linguistic and pragmatic phenomenon where expressions convey meanings that contrast with their literal interpretations, requiring sensitivity to the speaker's intent and context. Misinterpreting sarcasm in collaborative human–AI settings can lead to under- or overreliance on LLM outputs, with consequences ranging from breakdowns in communication to critical safety failures. We introduce Sarc7, a benchmark for fine-grained sarcasm evaluation based on the MUStARD dataset, annotated with seven pragmatically defined sarcasm types: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic. These categories are adapted from prior linguistic work and used to create a structured dataset suitable for LLM evaluation. For classification, we evaluate multiple prompting strategies—zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based technique—across five major LLMs. Emotion-based prompting yields the highest macro-averaged F1 score of 0.3664 (Gemini 2.5), outperforming CoT for several models and demonstrating its effectiveness in sarcasm type recognition. For sarcasm generation, we design structured prompts using fixed values across four sarcasm-relevant dimensions: incongruity, shock value, context dependency, and emotion. Using Claude 3.5 Sonnet, this approach produces more subtype-aligned outputs, with human evaluators preferring emotion-based generations 38.46% more often than zero-shot baselines. Sarc7 offers a foundation for evaluating nuanced sarcasm understanding and controllable generation in LLMs, pushing beyond binary classification toward interpretable, emotion-informed language modeling.

Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Recent advances in Large Language Models (LLMs) have enhanced the fluency and coherence of Conversational Recommendation Systems (CRSs), yet emotional intelligence remains a critical gap. In this study, we systematically evaluate the emotional behavior of six state-of-the-art LLMs in CRS settings using the ReDial and INSPIRED datasets. We propose an emotion-aware evaluation framework incorporating metrics such as Emotion Alignment, Emotion Flatness, and per-emotion F1-scores. Our analysis shows that most models frequently default to emotionally flat or mismatched responses, often misaligning with user affect (e.g., joy misread as neutral). We further examine patterns of emotional misalignment and their impact on user-centric qualities such as personalization, justification, and satisfaction. Through qualitative analysis, we demonstrate that emotionally aligned responses enhance user experience, while misalignments lead to loss of trust and relevance. This work highlights the need for emotion-aware design in CRS and provides actionable insights for improving affective sensitivity in LLM-generated recommendations.

Emotionally Aware or Tone-Deaf? Evaluating Emotional Alignment in LLM-Based Conversational Recommendation Systems

Jailbreaking, the phenomenon where specific prompts cause LLMs to assist with harmful requests, remains a critical challenge in NLP, particularly in non-English and lower-resourced languages. To address this, we introduce MULBERE, a method that extends the method of Targeted Latent Adversarial Training (T-LAT) to a multilingual context. We first create and share a multilingual jailbreak dataset spanning high-, medium-, and low-resource languages, and then fine-tune LLaMA-2-7b-chat with interleaved T-LAT for jailbreak robustness and chat examples for model performance. Our evaluations show that MULBERE reduces average multilingual jailbreak success rates by 75\% compared to the base LLaMA safety training and 71\% compared to English-only T-LAT while maintaining or improving standard LLM performance.

MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training

Language models perpetuate dialect bias, associating African American English (AAE) with negative traits and outcomes. We propose JustDial, a lightweight finetuning framework aligning character trait associations between meaning-matched AAE and Standardized American English (SAE) text, while preserving general model fluency through a KL-divergence regularization term. Experiments on GPT2-Medium show that JustDial successfully removed any statistically significant correlation between dialect and predicted occupational prestige and reduced conviction and death-sentencing disparities by more than 98.7\%, with only 100,000 text examples and one epoch of LoRA finetuning. Though this debiasing comes at the cost of general model performance, adjusting the regularization term in JustDial enables a navigable debiasing-performance tradeoff space. JustDial provides the first proof-of-concept towards mitigating dialect prejudice in language models.

JustDial: Language Model Dialect Debiasing Using Biased Character Trait Associations

Multilingual Pre-trained Language Models (multiPLMs) trained with the Masked Language Modeling (MLM) objective exhibit suboptimal performance on cross-lingual downstream tasks for Low-Resource Languages (LRLs). Continually pre-training these multiPLMs with the Translation Language Modeling (TLM) objective on parallel data improves the cross-lingual performance. However, both MLM and TLM mask tokens randomly, which does not guarantee optimal representation learning. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entities nouns, verbs and Named Entities, which hold a higher prominence in a sentence. We hypothesise that masking linguistically significant linguistic entities should contribute to effective representation learning. Empirically, we prove this using two downstream tasks with three LRL pairs, English-Sinhala, English-Tamil, and Sinhala-Tamil, and show that our LEM-based learning returns superior results compared to MLM+TLM.

Linguistic Entity Masking to improve Cross-Lingual representations in encoder-based LLMs

Our desires often influence our beliefs and expectations. Humans tend to think good things are more likely to happen than they actually are, while believing bad things are less likely. This tendency has been referred to as wishful thinking in research on coping strategies. With large language models (LLMs) increasingly being considered as computational models of human cognition, we investigate whether they can simulate this distinctly human bias. We conducted two systematic experiments across multiple LLMs, manipulating outcome desirability and information uncertainty across multiple scenarios including probability games, natural disasters, and sports events. Our experiments revealed limited wishful thinking in LLMs. In Experiment 1, only two models showed the bias, and only in sports-related scenarios when role-playing characters. Models exhibited no wishful thinking in mathematical contexts. Experiment 2 found that explicit prompting about emotional states (being hopeful) was necessary to elicit wishful thinking in logical domains. These findings reveal a significant gap between human cognitive biases and LLMs' default behavior patterns, suggesting that current models require explicit guidance to simulate wishful thinking influences on belief formation.

Investigating Motivated Inference in Large Language Models

We introduce a multilingual benchmark for evaluating large language models (LLMs) on hate speech detection and generation in low-resource Ethiopian languages: Afaan Oromo, Amharic and Tigrigna, and English (both monolingual and code-mixed). Using a balanced and expert-annotated dataset, we assess five state-of-the-art LLM families across both tasks. Our results show that while LLMs perform well on English detection, their performance on low-resource languages is significantly weaker, revealing that increasing model size alone does not ensure multilingual robustness. More critically, we find that all models, including closed and open-source variants, can be prompted to generate profiled hate speech with minimal resistance. These findings underscore the dual risk of exclusion and exploitation: LLMs fail to protect low-resource communities while enabling scalable harm against them. We make our evaluation framework available to facilitate future research on multilingual model safety and ethical robustness.

Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages

We investigate how Visual Language Models (VLMs) leverage visual features when making analogical comparisons about people. Using synthetic images of individuals varying in skin tone and nationality, we prompt GPT and Gemini models to make analogical associations with desserts and drinks. Results reveal that VLMs systematically associate darker-skinned individuals with brown-colored food items, with GPT showing stronger associations than Gemini. These biases are amplified in Thai versus English prompts, suggesting language-dependent encoding of visual stereotypes. The associations persist across manipulation checks including position swapping and clothing changes, though presenting individuals alone yields divergent language-specific patterns. This work reveals concerning biases in VLMs' visual reasoning that vary by language, with important implications for multilingual deployment.

Downloads

Next from EMNLP 2025

That Ain't Right: Assessing LLM Performance on QA in African American and West African English Dialects

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

That Ain't Right: Assessing LLM Performance on QA in African American and West African English Dialects

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads