China

Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance---providing general information without actionable details---emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.

EMNLP 2025

Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

human-llm interactions

llm safety

human-centered ai

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large language models are increasingly tailored to individual users, responding to differences in preferences, values, goals, and communication styles. In order to assess whether a model can meet the needs of diverse individuals, a foundational requirement is the ability to evaluate whether it can accurately infer and internally represent user-specific traits. We introduce MTPA, a Multitask Personalization Assessment benchmark, designed to test whether models can accurately infer user traits from partial profiles. Compiled from large-scale survey data, MTPA defines a set of trait prediction tasks across beliefs, values, and demographic attributes. MTPA evaluates whether models can accurately infer user traits from partial profiles and quantifies performance across diverse user subgroups. We then evaluate the contribution of trait inference to personalized text generation and empirically validate the relevance of trait inference for downstream alignment. Alongside the benchmark, we release a dataset, toolkit, and baseline evaluations. MTPA is designed with extensibility and sustainability in mind: as the underlying survey datasets are regularly updated, MTPA supports regular integration of new populations and user traits.

MPTA: MultiTask Personalization Assessment

Aligning small language models with human preferences is challenging, as weak policies struggle to generate informative on-policy samples and suffer from unstable gradients when trained on off-policy signals from stronger models. In this work, we propose ReAlign, a training framework that combines the stability of on-policy learning with the guidance of reviser-assisted supervision. In the ReAlign, we first train a lightweight reviser to improve policy-generated responses using preference-based supervision, conditioned on both the prompt and the initial output. And then, the policy is optimized using a combination of standard on-policy preference pairs and reviser-enhanced pairs constructed as a structured revision task, where the latter provide richer, more learnable feedback. Experimental results on AlpacaEval-2 and Arena-Hard demonstrate that ReAlign significantly boosts alignment performance for weak policies, outperforming strong preference optimization baselines.

ReAlign: Structured Revision for Small Language Model Alignment

Large language models (LLMs) demonstrate strong task-specific capabilities through fine-tuning, but merging multiple fine-tuned models often leads to degraded performance due to overlapping instruction-following components. Task Arithmetic (TA), which combines task vectors derived from fine-tuning, enables multi-task learning and task forgetting but struggles to isolate task-specific knowledge from general instruction-following behavior. To address this, we propose Layer-Aware Task Arithmetic (LATA), a novel approach that assigns layer-specific weights to task vectors based on their alignment with instruction-following or task-specific components. By amplifying task-relevant layers and attenuating instruction-following layers, LATA improves task learning and forgetting performance while preserving overall model utility. Experiments on multiple benchmarks, including WikiText-2, GSM8K, and HumanEval, demonstrate that LATA outperforms existing methods in both multi-task learning and selective task forgetting, achieving higher task accuracy and alignment with minimal degradation in output quality. Our findings highlight the importance of layer-wise analysis in disentangling task-specific and general-purpose knowledge, offering a robust framework for efficient model merging and editing.

Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge

Clinical interviews are a standard method for assessing depression. Recent approaches have improved prediction accuracy by focusing on specific questions posed by the interviewer and manually selected question-answer (QA) pairs that target mental health content. However, these methods often neglect the broader conversational context, resulting in limited generalization and reduced robustness, particularly in less structured interviews, which are common in real-world clinical settings. In this work, we develop a multimodal dialogue-level transformer that captures the dynamics of dialogue within each interview by using a combination of sequential positional embedding and question context vectors. In addition to the depression prediction branch, we build an adversarial classifier with a gradient reversal layer to learn shared representations that remain invariant to the types of questions asked during the interview. This approach aims to reduce biased learning and improve the fairness and generalizability of depression detection in diverse clinical interview scenarios. Classification and regression experiments conducted on three real-world interview-based datasets and one synthetic dataset demonstrate the robustness and generalizability of our model.

Mitigating Interviewer Bias in Multimodal Depression Detection: An Approach with Adversarial Learning and Contextual Positional Encoding

Cross-lingual chain-of-thought prompting techniques have proven effective for investigating diverse reasoning paths in Large Language Models (LLMs), especially for low-resource languages. Despite these empirical gains, the mechanisms underlying cross-lingual improvements remain perplexing. This study, therefore, addresses whether the benefits of cross-lingual prompting arise from language-specific reasoning structures intrinsic to each language, or are simply a consequence of improved comprehension through cross-linguistic exposure. We employ neuron intervention and perturbation techniques to analyze and deactivate language-specific reasoning neurons during cross-lingual prompting, leading to performance disparities across languages, up to 27.4\%. Our findings disentangle that these neurons are essential for reasoning in their respective languages, but have minimal effect on reasoning in other languages, providing evidence for the existence of language-specific local reasoning structures and guiding the development of more interpretable and effective multilingual AI systems.

Disentangling Language Understanding and Reasoning Structures in Cross-lingual Chain-of-Thought Prompting

Recent studies show the promise of large language models (LLMs) for few-shot tabular classification but highlight challenges due to the variability in structured data. To address this, we propose distilling data into actionable insights to enable robust and effective classification by LLMs. Drawing inspiration from human learning processes, we introduce InsightTab, an insight distillation framework guided by principles of divide-and-conquer, easy-first, and reflective learning. Our approach integrates rule summarization, strategic exemplification, and insight reflection through deep collaboration between LLMs and data modeling techniques. The obtained insights enable LLMs to better align their general knowledge and capabilities with the particular requirements of specific tabular tasks. We extensively evaluate InsightTab on nine datasets. The results demonstrate consistent improvement over state-of-the-art methods. Ablation studies further validate the principle-guided distillation process, while analyses emphasize InsightTab's effectiveness in leveraging labeled data and managing bias.

Summarize-Exemplify-Reflect: Data-driven Insight Distillation Empowers LLMs for Few-shot Tabular Classification

Quantifying uncertainty in black-box LLMs is vital for reliable responses and scalable oversight. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the target query, can be misleading: an LLM may confidently provide an incorrect answer to a target query, yet give a confident and accurate answer to that same target query when answering a knowledge-preserving perturbation of the query. We systematically analyze the model behaviors and demonstrate that this discrepancy stems from suboptimal retrieval of parametric knowledge, often due to contextual biases that prevent consistent access to stored knowledge. We then introduce DiverseAgentEntropy, a novel, theoretically-grounded method employing multi-agent interaction across diverse query variations for uncertainty estimation of black-box LLMs. This approach more accurately assesses an LLM's true uncertainty and improves hallucination detection, outperforming existing self-consistency based techniques.

Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty

Open-ended Event Forecasting (OEEF) is vital in various real-world applications. However, it faces challenges, including limited availability of datasets that enhance LLM's predictive capabilities and crude methods of organizing forecast-related information. In this work, we construct a large-scale dataset NewsForest that contains 12,406 prediction chains reflecting the drivers of event development. To effectively extract information from the prediction background, we propose a prediction method, ForestCast. ForestCast organizes all relevant news into a story tree and predicts each branch based on the story tree. ForestCast has five main steps: (1) collecting and cleaning news, (2) clustering news into event nodes, (3) constructing the news story tree, (4) mining the semantic structure of the news story tree, (5) predicting the next node and evaluating the quality of the predictions. Experiments demonstrate that the NewsForest dataset enhances the model’s ability to predict these structures. The ForestCast method improves the accuracy and quality of predictions.

ForestCast: Open-Ended Event Forecasting with Semantic News Forest

To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called XDeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

The spread of fake news on social media poses a serious threat to public trust and societal stability. While propagation-based methods improve fake news detection by modeling how information spreads, they often suffer from incomplete propagation data. Recent work leverages large language models (LLMs) to generate synthetic propagation, but typically overlooks the structural patterns of real-world discussions. We propose a novel structure-aware synthetic propagation enhanced detection (StruSP) framework to fully capture structural dynamics from real propagation and enable LLMs to generate realistic and structurally consistent propagation for better detection. Different from existing methods that rely on role-playing simulations for propagation generation, our StruSP explicitly aligns synthetic propagation with real-world propagation in both semantic and structural dimensions. Besides, we also design a new bidirectional evolutionary propagation (BEP) learning strategy to better align LLMs with structural patterns of propagation in the real world via structure-aware hybrid sampling and masked propagation modeling objective. Experiments on three public datasets demonstrate that StruSP significantly improves fake news detection performance in different settings. Further analysis indicates that BEP enables the LLM to generate more realistic and diverse propagation.

Downloads

Next from EMNLP 2025

MPTA: MultiTask Personalization Assessment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES