China

With the widespread application of large language models (LLMs) across various domains, techniques for enhancing their security have progressed rapidly. In this paper, we reveal that although existing defense methods can improve the robustness of LLMs against jailbreaks, they compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of LLM mechanism interpretability, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against both single-turn and multi-turn jailbreak attacks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training.

EMNLP 2025

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability

over-refusal mitigation

defense against jailbreak attacks

large language model

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large language models (LLMs) have become essential tools for digital task assistance. Their training relies heavily on the collection of vast amounts of data, which may include copyright-protected or sensitive information. Recent studies on detecting pretraining data in LLMs have primarily focused on sentence- or paragraph-level membership inference attacks (MIAs), usually involving probability analysis of the target model's predicted tokens. However, these methods often exhibit poor accuracy, failing to account for the semantic importance of textual content and word significance. To address these shortcomings, we propose Tag&Tab, a novel approach for detecting data used in LLM pretraining. Our method leverages advanced natural language processing (NLP) techniques to tag keywords in the input text—a process we term Tagging. Then, the LLM is used to obtain probabilities for these keywords and calculate their average log-likelihood to determine input text membership, a process we refer to as Tabbing. Our experiments on four benchmark datasets (BookMIA, MIMIR, PatentMIA, and the Pile) and several open-source LLMs of varying sizes demonstrate an average increase in AUC scores ranging from 5.3% to 17.6% over state-of-the-art methods. Tag&Tab not only sets a new standard for data leakage detection in LLMs, but its outstanding performance is a testament to the importance of words in MIAs on LLMs.

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

Instruction tuning—supervised fine-tuning using instruction-response pairs—is a key step in making pre-trained large language models (LLMs) instructable. Meanwhile, LLMs perform multitask learning during their pre-training, acquiring extensive knowledge and capabilities. We hypothesize that the pre-training stage can enable them to develop the ability to comprehend and address instructions. To verify this, we propose Response Tuning (RT), which removes the instruction and its corresponding mapping to the response from instruction tuning. Instead, it focuses solely on establishing the response distribution. Our experiments demonstrate that RT models, trained only on responses, can effectively respond to a wide range of instructions and exhibit helpfulness approaching that of their instruction-tuned counterparts. In addition, we observe that the models can recognize and reject unsafe queries after learning the refusal conditions from training responses. Furthermore, we demonstrate that these observations also hold in an in-context learning setting. These findings support our hypothesis, highlighting the extensive inherent capabilities of pre-trained LLMs.

Revealing the Inherent Instructability of Pre-Trained Language Models

Personality assessment is essential for developing user-centered systems, playing a critical role across domains including hiring, education, and personalized system design. With the integration of conversational AI systems into daily life, automatically assessing human personality through natural language interaction has gradually gained more attention. However, existing personality assessment datasets based on natural language generally lack consideration of interactivity. Therefore, we propose Personality-1260, a Chinese dataset containing 1260 interaction rounds between humans and agents with different personalities, aiming to support research on personality assessment. Based on this dataset, we designed experiments to explore the effects of different interaction rounds and agent personalities on personality assessment. Results show that fewer interaction rounds perform better in most cases, and agents with different personalities stimulate different expressions of users' personalities. These findings provide guidance for the design of interactive personality assessment systems.

Rethinking Personality Assessment from Human-Agent Dialogues: Fewer Rounds May Be Better Than More

Persistent societal biases like misogyny express themselves more often implicitly than through openly hostile language. However, previous misogyny studies have focused primarily on explicit language, overlooking these more subtle forms. We bridge this gap by examining implicit misogynistic expressions in English and Italian. First, we develop a taxonomy of social dynamics, i.e., the underlying communicative intent behind misogynistic statements in social media data. Then, we test the ability of nine LLMs to identify the social dynamics as a multi-label classification and text span selection: first LLMs must choose social dynamics given a prefixed list, then they have to explicitly identify the text spans that triggered their decisions. We also investigate the extent of using different learning settings: zero and few-shot, and prescriptive. Our analysis suggests that LLMs struggle to follow instructions and reason in all settings, mostly relying on semantic associations, recasting claims of emergent abilities.

The "r" in "woman" stands for rights. Auditing LLMs in Uncovering Social Dynamics in Implicit Misogyny

Fact verification on knowledge graphs (KGs) uses the structured representation of entities and relations as evidence for validating claims. Previous methods for KG-based fact verification predominantly use natural language inference (NLI) models to predict entailment between claims and KG triples, based on implicit reasoning. We propose Programmatic Graph Reasoning (PGR), a novel framework that integrates large language models (LLMs) for fact verification on KGs. PGR explicitly encodes the reasoning process as a graph reasoning program composed of predefined functions to verify claims step by step. These functions are executed sequentially for graph reasoning and final result prediction. By making the graph reasoning process explicit, PGR ensures more precise and transparent reasoning steps compared to implicit methods. Experimental results on the FactKG dataset demonstrate that PGR achieves state-of-the-art performance with 86.82% accuracy, outperforming all the baseline models. Further analysis confirms the interpretability and effectiveness of our method in handling complex graph reasoning.

Fact Verification on Knowledge Graph via Programmatic Graph Reasoning

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that enhances the simplicity and training stability of reinforcement learning through reward function reparameterization from PPO. Recently, SimPO (Simple Preference Optimization) and CPO (Contrastive Preference Optimization) have proposed reference-free preference optimization methods to simplify DPO's training process. We observe that these reference-free methods exhibit higher training efficiency but are prone to overoptimization, leading to performance degradation. To address these issues, we propose Self Preference Optimization (SPO). SPO employs the SiLU function to replace the conventional logsigmoid loss function. The SiLU function attains its minimum at a finite value, preventing the model from excessively amplifying the chosen-rejected sample probability ratio and thereby mitigating overoptimization problem. We theoretically demonstrate that the SPO loss is an upper bound of the DPO loss, implying that optimizing the SPO objective implicitly optimizes the DPO objective. We evaluate SPO's effectiveness across multiple benchmarks including AlpacaEval 2 and MT-Bench. Experimental results show that SPO achieves a 7% improvement over SimPO in length-controlled win rate on AlpacaEval 2, while demonstrating superior performance on MT-Bench.

SPO: Self Preference Optimization with Self Regularization

Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) method which prepends an affirmative instruction to the input and scales the logits by temperature to distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference.

LLM Jailbreak Detection for (Almost) Free!

Large language models(LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We proposed the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous "experience-reflection-optimization" cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the "Era of Experience" to the "Era of Simulation" by generating high-fidelity behavioral and interaction data.

Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics

With the widespread applications of large language models (LLMs), aligning LLMs with human values has emerged as a critical challenge. For alignment, we always expect LLMs to be honest, positive, harmless, etc. And LLMs appear to be capable of generating the desired outputs after the alignment tuning process, such as the preference tuning via reinforcement learning from human feedback (RLHF). However, it also raises a question about **after alignment, do LLMs genuinely obtain a value distinction between positives and negatives, beyond the generation of positive outputs?** In this work, we start by investigating this question from the token distribution perspective. Our findings reveal that compared to the unaligned versions, LLMs after alignment exhibit a larger logits gap between positive and negative tokens at each generation step, which suggests that LLMs do obtain a value distinction of positives and negatives after alignment. Meanwhile, it also motivates us to achieve alignment by directly constructing such value distinction, thus alleviating the excessive reliance on computational resources required by training-time alignment. Specifically, we propose a representation editing method that intervenes the last hidden representation by amplifying the logits difference between positive and negative tokens (defined as anchor words). Experimental results demonstrate that the proposed method not only achieves effective alignment, but also requires fewer computational resources compared to training-time alignment methods

Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning

Large Language Models have achieved significant advancements in various natural language processing tasks. However, they are susceptible to generating hallucinations-fabricated or inaccurate statements presented as factual information-which can undermine their reliability in high-stakes applications. To address this issue, we propose a new inference-stage hallucination mitigation method, Regularized Contrastive Decoding (RCD), to exploit hard negative samples for improving the robustness of contrastive decoding. Additionally, we design a new adversarial-aware regularization term to finetune hallucination models to learn more challenging and diverse hallucination patterns from available data with the guidance of adversarial perturbations. This enhances the contrastive decoding process, enabling more effective identification and filtering of erroneous content. We conduct experiments on four public hallucination benchmarks. Experimental results show our method achieves better hallucination mitigation performance consistently, proving the effectiveness and superiority of RCD for hallucination mitigation.

Downloads

Next from EMNLP 2025

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES