China

Prompt-injection and jailbreak attacks can coerce large language models (LLMs) into revealing system prompts or producing unsafe content, threatening real-world deployments. We present Proxy Barrier (ProB), a lightweight defense that interposes a repeater proxy LLM between the user and the target model. The repeater acts to verbatim-echo benign user input, and any divergence indicates adversarial tampering, causing the request to be dropped before it reaches the target model, so that attempts to bypass safety boundaries are blocked. ProB therefore requires no access to model weights or prompts, is model-agnostic, and deployable entirely at the API level. Experiments across multiple model families demonstrate that ProB achieves state-of-the-art resilience against prompt leakage and jailbreak attacks. Notably, our approach achieves up to 98.8% improvement in defense effectiveness over baselines, and shows robust protection across both open and closed-source LLMs when suitably paired with proxy models.

EMNLP 2025

Proxy Barrier: A Hidden Repeater Layer Defense Against System Prompt Leakage and Jailbreaking

guardrail

artificial intelligence

safety

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

We introduce AraSafe, the first native Arabic large-scale safety benchmark for large language models (LLMs), addressing the pressing need for culturally and linguistically representative evaluation resources. The dataset consists of 12K naturally occurring, human-written Arabic prompts spanning diverse domains, such as linguistics, social studies, sciences, and safety. Each prompt was independently annotated by two expert annotators into one of nine fine-grained safety categories, including 'Illegal Activities', 'Violence or Harm', 'Privacy Violation', and 'Hate Speech'. To enrich the representation of harmful content, we augmented the dataset with 12K synthetic harmful prompts generated using GPT-4o via carefully designed prompt engineering techniques. We benchmark a number of Arabic-centric and multilingual models in the 7–13B parameter range, including Jais, AceGPT, Allam, Fanar, Llama-3, Gemma-2, and Qwen3, as well as BERT-based fine-tuned models. GPT-4o was used as an upper-bound reference baseline. Our evaluation reveals critical safety blind spots in Arabic LLMs and underscores the necessity of localized, culturally grounded benchmarks for building responsible AI systems.

AraSafe: Benchmarking Safety in Arabic LLMs

Recently, advancements in AI counseling based on large language models have shown significant progress. However, existing studies employ a one-time generation approach to synthesize multi-turn dialogue samples, resulting in low therapy fidelity and a lack of deeper understanding of the therapeutic logic underlying each response turn. In this work, we propose a novel data synthesis framework, CATCH, designed to address the challenges above. Specifically, to improve therapy fidelity, we introduce the Progressive Dialogue Synthesis strategy to systematically derive key elements from the client's self-report and organize them into structured outlines, thereby generating a counseling dialogue dataset aligned with therapeutic intervention principles. To enhance therapy logic, we propose the Memory-Driven Dynamic Planning thinking pattern, which clarifies decision-making motivations for each dialogue turn. This pattern incorporates memory enhancement, global planning, and strategy reasoning, enabling the development of a collaborative multi-agent iterative optimization method that synthesizes complete chains of thought in counseling dialogues. Extensive experiments demonstrate that CATCH significantly enhances the fidelity and logical coherence in AI counseling.

CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling

The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination — partial/entire benchmark data is included in the model's training set — poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories — unimodal and cross-modal — and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.

Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM

Recent trends in LLMs development clearly show growing interests in use and application of Sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users’ socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential risk. To address this gap, we collect a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always follow the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.

Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Personality drift in Large Language Models (LLMs) poses a textitsafety and alignment risk: a model that shifts traits across a dialogue can produce inconsistent or harmful behaviours. Yet most existing psychometric evaluations probe LLMs in a context‑free vacuum, answering each item in isolation—what we call the textitDisney World test. We introduce the first textitcontext‑aware framework that transforms conversational history into a rigorous stress test for persona stability. Our method (i) simulates realistic multi‑turn interactions, (ii) proposes new consistency metrics to quantify alignment‑critical trait drift, and (iii) red‑teams models via prompt inconsistency factors. Across 7 frontier and open LLMs, conversational context boosts answer consistency through in‑context learning but also triggers notable personality shifts: textttGPT‑3.5‑Turbo and textttGPT‑4‑Turbo show the most extreme deviations. We find that textttGPT models remain robust to question ordering, whereas textttGemini‑1.5‑Flash and textttLlama‑3.1‑8B are highly order‑sensitive. Our causal analysis suggests textttGPT responses blend intrinsic persona signals with conversational cues, while textttGemini‑1.5‑Flash and textttLlama‑3.1‑8B rely predominantly on recent context, a potential vulnerability for adversarial steering. We further validate on Role‑Playing Agents, demonstrating that context‑aware alignment yields responses rated more consistent and human‑aligned. Our open‑sourced toolkit enables practitioners to diagnose and monitor persona‑driven safety risks before deployment.

CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

Existing Artificial Olfaction (AO) primarily serves two tasks: Odor Classification (OC) and Odor Source Localization (OSL). Both tasks w.r.t. in indoor event detection scenarios are studied either using a single electronic nose (e-nose) mounted on the ceiling or mobile robot(s) equipped with e-noses. However, they are not compatible with smart home scenarios due to diverse obstacles (e.g., chairs and tables) and need for natural interaction. In this paper, we explore the feasibility and usability of a Conversational Interfaces for Artificial Olfaction (CIAO) system using Large Language Models (LLMs) in Smart Home. We made the first olfaction-oriented corpus for LLM evaluation, as well as an olfaction dataset via a self-developed olfactory sensory network. We train the dedicated models for OSL and OC using the dataset, and integrate them into a tool within the MCP (Model Context Protocol) server. Five commercial LLMs are used as MCP clients for experiments and validation. Our experimental results indicate that our CIAO system is technically feasible and applicable. Besides, we observe that ChatGPT-4o relatively outperforms in terms of both answer quality and overall LLM usability in pervasive IoT scenarios. Qwen-Plus, in contrast, appears to be a promising solution for robot-compatible applications. To the best of our knowledge, this is the first effort to bring forward conversational interfaces for AO.

Where Does This Strange Smell Come from?: Enabling Conversational Interfaces for Artificial Olfaction

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user needs. However, existing RAG systems have significant limitations, including reliance on flat data representations and inadequate contextual awareness, which can lead to fragmented answers that fail to capture complex interdependencies. To address these challenges, we propose LightRAG, a novel framework that incorporates graph structures into text indexing and retrieval processes. This innovative approach employs a dual-level retrieval system that enhances comprehensive information retrieval from both low- and high-level knowledge discovery. Additionally, the integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. This capability is further enhanced by an incremental update algorithm that ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. Extensive experimental validation demonstrates considerable improvements in retrieval accuracy and efficiency compared to existing approaches. We have made our LightRAG framework open source and anonymously available at the link: https://anonymous.4open.science/r/LightRAG-2BEE.

LightRAG: Simple and Fast Retrieval-Augmented Generation

Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose **S**teering LVLMs via **S**AE **L**atent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs' multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47\%, 3.27\%, and 1.01\% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on https://anonymous.4open.science/r/KL-GMoE-A8E4.

Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing

Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from *localization*, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose **Self-Attention One-step Belief Propagation (SAOBP)**, a refinement framework that injects *multi-hop* relationships through a belief propagation process. To interpret and quantify these interactions, we introduce **Global Token Dependency (GTD)** that captures the relative contribution of multi-hop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.

Downloads

Next from EMNLP 2025

AraSafe: Benchmarking Safety in Arabic LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES