China

The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination — partial/entire benchmark data is included in the model&#39;s training set — poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories — unimodal and cross-modal — and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.

EMNLP 2025

Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM

and fairness

ethics

bias

The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination — partial/entire benchmark data is included in the model's training set — poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories — unimodal and cross-modal — and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent trends in LLMs development clearly show growing interests in use and application of Sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users’ socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential risk. To address this gap, we collect a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always follow the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.

Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Personality drift in Large Language Models (LLMs) poses a textitsafety and alignment risk: a model that shifts traits across a dialogue can produce inconsistent or harmful behaviours. Yet most existing psychometric evaluations probe LLMs in a context‑free vacuum, answering each item in isolation—what we call the textitDisney World test. We introduce the first textitcontext‑aware framework that transforms conversational history into a rigorous stress test for persona stability. Our method (i) simulates realistic multi‑turn interactions, (ii) proposes new consistency metrics to quantify alignment‑critical trait drift, and (iii) red‑teams models via prompt inconsistency factors. Across 7 frontier and open LLMs, conversational context boosts answer consistency through in‑context learning but also triggers notable personality shifts: textttGPT‑3.5‑Turbo and textttGPT‑4‑Turbo show the most extreme deviations. We find that textttGPT models remain robust to question ordering, whereas textttGemini‑1.5‑Flash and textttLlama‑3.1‑8B are highly order‑sensitive. Our causal analysis suggests textttGPT responses blend intrinsic persona signals with conversational cues, while textttGemini‑1.5‑Flash and textttLlama‑3.1‑8B rely predominantly on recent context, a potential vulnerability for adversarial steering. We further validate on Role‑Playing Agents, demonstrating that context‑aware alignment yields responses rated more consistent and human‑aligned. Our open‑sourced toolkit enables practitioners to diagnose and monitor persona‑driven safety risks before deployment.

CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

Existing Artificial Olfaction (AO) primarily serves two tasks: Odor Classification (OC) and Odor Source Localization (OSL). Both tasks w.r.t. in indoor event detection scenarios are studied either using a single electronic nose (e-nose) mounted on the ceiling or mobile robot(s) equipped with e-noses. However, they are not compatible with smart home scenarios due to diverse obstacles (e.g., chairs and tables) and need for natural interaction. In this paper, we explore the feasibility and usability of a Conversational Interfaces for Artificial Olfaction (CIAO) system using Large Language Models (LLMs) in Smart Home. We made the first olfaction-oriented corpus for LLM evaluation, as well as an olfaction dataset via a self-developed olfactory sensory network. We train the dedicated models for OSL and OC using the dataset, and integrate them into a tool within the MCP (Model Context Protocol) server. Five commercial LLMs are used as MCP clients for experiments and validation. Our experimental results indicate that our CIAO system is technically feasible and applicable. Besides, we observe that ChatGPT-4o relatively outperforms in terms of both answer quality and overall LLM usability in pervasive IoT scenarios. Qwen-Plus, in contrast, appears to be a promising solution for robot-compatible applications. To the best of our knowledge, this is the first effort to bring forward conversational interfaces for AO.

Where Does This Strange Smell Come from?: Enabling Conversational Interfaces for Artificial Olfaction

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user needs. However, existing RAG systems have significant limitations, including reliance on flat data representations and inadequate contextual awareness, which can lead to fragmented answers that fail to capture complex interdependencies. To address these challenges, we propose LightRAG, a novel framework that incorporates graph structures into text indexing and retrieval processes. This innovative approach employs a dual-level retrieval system that enhances comprehensive information retrieval from both low- and high-level knowledge discovery. Additionally, the integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. This capability is further enhanced by an incremental update algorithm that ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. Extensive experimental validation demonstrates considerable improvements in retrieval accuracy and efficiency compared to existing approaches. We have made our LightRAG framework open source and anonymously available at the link: https://anonymous.4open.science/r/LightRAG-2BEE.

LightRAG: Simple and Fast Retrieval-Augmented Generation

Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose **S**teering LVLMs via **S**AE **L**atent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs' multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47\%, 3.27\%, and 1.01\% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on https://anonymous.4open.science/r/KL-GMoE-A8E4.

Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing

Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from *localization*, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose **Self-Attention One-step Belief Propagation (SAOBP)**, a refinement framework that injects *multi-hop* relationships through a belief propagation process. To interpret and quantify these interactions, we introduce **Global Token Dependency (GTD)** that captures the relative contribution of multi-hop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.

Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation

The generation of incorrect images—such as depictions of people of color in Nazi-era uniforms by Gemini—frustrated users and harmed Google’s reputation, motivating us to investigate the relationship between accurately reflecting factuality and promoting diversity and equity. In this study, we focus on 19 real-world statistics collected from authoritative sources. Using these statistics, we develop a checklist comprising objective and subjective queries to analyze behavior of large language models (LLMs) and text-to-image (T2I) models. Objective queries assess the models’ ability to provide accurate world knowledge. In contrast, the design of subjective queries follows a key principle: statistical or experiential priors should not be overgeneralized to individuals, ensuring that models uphold diversity. These subjective queries are based on three common human cognitive errors that often result in social biases. We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects. Results show that GPT-4o and DALL-E 3 perform notably well among six LLMs and four T2I models. Our code is in the supplementary materials and will be open-source upon publication.

Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases

Due to the widespread dissemination of rumors on social media platforms, detecting rumors has been a long-standing concern for various communities. However, existing rumor detection methods rarely consider the fairness issues inherent in the model, which can lead to biased predictions across different stakeholder groups (e.g., domains and originating platforms of the detected content), also undermining their detection effectiveness. In this work, we propose a two-step framework to address this issue. First, we perform unsupervised partitioning to dynamically identify potential unfair data patterns without requiring sensitive attribute annotations. Then, we apply invariant learning to these partitions to extract fair and informative feature representations that enhance rumor detection. Extensive experiments show that our method outperforms strong baselines regarding detection and fairness performance, and also demonstrate robust performance on out-of-distribution samples. Further empirical results indicate that our learned features remain informative and fair across stakeholder groups and can correct errors when applied to existing baselines.

Equal Truth: Rumor Detection with Invariant Group Fairness

Large Language Models store extensive factual knowledge acquired during large-scale pre-training. However, this knowledge is inherently static, reflecting only the state of the world at the time of training. Knowledge editing has emerged as a promising solution for updating outdated or incorrect facts without full retraining. However, most existing locate-and-edit methods primarily focus on token-level likelihood optimization without addressing semantic coherence. Our analysis reveals that such edited knowledge is often encoded as isolated residual streams in the model's latent space, distinct from pre-existing knowledge and bypassing natural reasoning process. To address this, we propose STEAM, a semantic-level knowledge editing framework that enhances integration of updated knowledge into the model's knowledge structure. STEAM first identifies target representations as semantic anchors for the updated factual association, then guides the internal representation of the edited fact towards these anchors through an alignment loss during optimization. Experimental results demonstrate that STEAM improves model’s ability to reason with edited knowledge and enhances semantic coherence, underscoring the importance of latent-space alignment for reliable and coherent knowledge editing.

Downloads

Next from EMNLP 2025

Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES