China

Vision-Language Models (VLMs) often appearculturally competent but rely on superficial pat.tern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themedcultural imagery through both classification andexplanation analysis. Testing multiple modelson Western festivals, non-Western traditions.and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresentedcultural events, frequently offering vague labelsor dangerously misclassifying emergencies ascelebrations. These failures expose the risksof symbolic shortcuts and highlight the needfor cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodalsystems.

EMNLP 2025

Seeing Symbols, Missing Cultures: Probing Vision-Language Models&#39;Reasoning on Fire lmagery and Cultural Meaning

symbolic shortcuts

cultural bias

safety-critical

Seeing Symbols, Missing Cultures: Probing Vision-Language Models'Reasoning on Fire lmagery and Cultural Meaning

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Abstract Meaning Representation (AMR) is a graph-based semantic representation that has been incorporated into numerous downstream tasks, in particular due to substantial efforts developing text-to-AMR parsing and AMR-to-text generation models. However, there still exists a large gap between fluent, natural sentences and texts generated from AMR-to-text generation models. Prompt-based Large Language Models (LLMs), on the other hand, have demonstrated an outstanding ability to produce fluent text in a variety of languages and domains. In this paper, we investigate the extent to which LLMs can improve the AMR-to-text generated output fluency post-hoc via prompt engineering. We conduct automatic and human evaluations of the results, and ultimately have mixed findings: LLM-generated paraphrases generally do not exhibit improvement in automatic evaluation, but outperform baseline texts according to our human evaluation. Thus, we provide a detailed error analysis of our results to investigate the complex nature of generating highly fluent text from semantic representations.

GPT4AMR: Does LLM-based Paraphrasing Improve AMR-to-text Generation Fluency?

Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.

Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian

An automatic court hearing transcription system is being developed for the Federal Supreme Court of Ethiopia to address the challenges faced in manual transcription. By utilizing Automatic Speech Recognition technology, the system aims to transcribe Amharic language court recordings accurately and efficiently. This innovative solution not only improves the court system but also safeguards the health of transcribers and enhances the overall speed and quality of legal proceedings in Ethiopia. In this study, a self-supervised Transformer based Wave2Vec 2.0 approach has been conducted to build an ASR system. With a dataset comprising over 500 hours of unlabeled data, the system has achieved a remarkable Word Error Rate (WER) of 14.36%, showcasing its effectiveness in transcribing court proceedings with high accuracy.

Wav2Vec-Based Self-Supervised Learning for Court Hearing Transcription

Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a teacher professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of the types of biases represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets, potentially making bias mitigation by self-debiasing harder. HESEIA is available to support bias assessments grounded in educational communities.

An intersectional bias evaluation dataset grounded in educational contexts

In this study, we investigate how author affiliation shapes academic discourse, proposing it as an effective proxy for author perspective in understanding what topics are studied, how nations are framed, and whose realities are prioritised. Using Palestine as a case study, we apply BERTopic and Structural Topic Modelling (STM) to 29,536 English-language academic articles collected from the OpenAlex database. We find that domestic authors focus on practical, local issues like healthcare, education, and the environment, while foreign authors emphasise legal, historical, and geopolitical discussions. These differences, in our interpretation, reflect lived proximity to war and crisis. We also note that while BERTopic captures greater lexical nuance, STM enables covariate-aware comparisons, offering deeper insight into how affiliation correlates with thematic emphasis. We propose extending this framework to other underrepresented countries, including a future study focused on Gaza post-October 7.

Whose Palestine Is It? A Topic Modelling Approach to National Framing in Academic Research

Named Entity Recognition (NER) is the information extraction task of identifying predefined named entities such as person names, location names, organization names and more. High-resource languages have made significant progress in NER tasks. However, low-resource languages such as Kurmanji Kurdish have not seen the same advancements, due to these languages having less available data online. This research aims to close this gap by developing an NER system via fine-tuning XLM-RoBERTa on a manually annotated dataset for Kurmanji. The dataset used for fine-tuning consists of 7,919 annotated sentences, which were manually annotated by three native Kurmanji speakers. The classes labeled in the dataset are Person (PER), Organization (ORG), and Location (LOC). A web-based application has also been developed using Streamlit to make the model more accessible. The model achieved an F1 score of 0.8735, precision of 0.8668, and recall of 0.8803, demonstrating the effectiveness of fine-tuning transformer-based models for NER tasks in low-resource languages. This work establishes a methodology that can be applied to other low-resource languages and Kurdish varieties.

Fine-tuning XLM-RoBERTa for Named Entity Recognition in Kurmanji Kurdish

As Large Language Models (LLMs) are deployed in every aspect of our lives, understanding how they reason about moral issues becomes critical for AI safety. We investigate this using a dataset we curated from Reddit's r/AmItheAsshole, comprising real-world moral dilemmas with crowd-sourced verdicts. Through experiments on five state-of-the-art LLMs across 847 posts, we find a significant and systematic divergence where LLMs are more lenient than humans. Moreover, we find that translating the posts into another language changes LLMs' verdicts, indicating their judgments lack cross-lingual stability.

Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis

The Nüshu script, originating from Jiangyong County, China, is the world’s only known writing system historically created and used exclusively by women. Although Natural Language Processing (NLP) efforts have begun digitizing limited Nüshu-Chinese text pairs, computational access to the script remains highly restricted due to its handwritten, visual nature and absence of multimodal tools. We contribute two novel datasets: NüshuVision, an image corpus of 500 rendered sentences in traditional vertical, right-to-left orthography, and NüshuStrokes, the first sequential handwriting recordings of all 397 Unicode Nüshu characters by an expert calligrapher. Benchmarking five leading Chinese OCR systems on NüshuVision shows a consistent Character Error Rate (CER) of 1.0. Fine-tuning Microsoft’s TrOCR model reduces CER to 0.67. These resources mark a crucial step toward multimodal processing of Nüshu and present a new paradigm for culturally sensitive language revitalization.

Revitalizing Nüshu Through Mixed Media

This paper focuses on data-driven dependency parsing for Vedic Sanskrit. We propose and evaluate a transfer learning approach that benefits from syntactic analysis of typologically related languages, including Ancient Greek and Latin, and a descendant language - Classical Sanskrit. Experiments on the Vedic TreeBank demonstrate the effectiveness of cross-lingual transfer, demonstrating improvements from the biaffine baseline as well as outperforming the current state of the art benchmark, the deep contextualised self-training algorithm, across a wide range of experimental setups.

Transfer learning for dependency parsing of Vedic Sanskrit

Political stance detection in low-resource and culturally complex settings poses a critical challenge for large language models (LLMs). In the Thai political landscape—rich with indirect expressions, polarized figures, and sentiment-stance entanglement—LLMs often exhibit systematic biases, including sentiment leakage and entity favoritism. These biases not only compromise model fairness but also degrade predictive reliability in real-world applications. We introduce ThaiFACTUAL, a lightweight, model-agnostic calibration framework that mitigates political bias without fine-tuning LLMs. ThaiFACTUAL combines counterfactual data augmentation with rationale-based supervision to disentangle sentiment from stance and neutralize political preferences. We curate and release the first high-quality Thai political stance dataset with stance, sentiment, rationale, and bias markers across diverse political entities and events. Our results show that ThaiFACTUAL substantially reduces spurious correlations, improves zero-shot generalization, and enhances fairness across multiple LLMs. This work underscores the need for culturally grounded bias mitigation and offers a scalable blueprint for debiasing LLMs in politically sensitive, underrepresented languages.

Downloads

Next from EMNLP 2025

GPT4AMR: Does LLM-based Paraphrasing Improve AMR-to-text Generation Fluency?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES