China

Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a question correctly, but a seemingly trivial perturbation, which can happen in real-world settings, causes it to produce a hallucinated response with high certainty. This phenomenon, which we dub CHOKE (Certain Hallucinations Overriding Known Evidence), is particularly concerning in high-stakes domains such as medicine or law, where model certainty is often used as a proxy for reliability. We show that CHOKE examples are consistent across prompts, occur in different models and datasets, and are fundamentally distinct from other hallucinations. This difference leads existing mitigation methods to perform worse on CHOKE examples than on general hallucinations. Finally, we introduce a probing-based mitigation that outperforms existing methods on CHOKE hallucinations. These findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety.

EMNLP 2025

Trust Me, I&#39;m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

certainty

hallucination

mitigation

knowledge

Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

In Semantic Change Detection (SCD), it is a common problem to obtain embeddings that are both interpretable and high-performing. However, improving interpretability often leads to a loss in the SCD performance, and vice versa. To address this problem, we propose SCDTour, a method that orders and merges interpretable axes to alleviate the performance degradation of SCD. SCDTour considers both (a) semantic similarity between axes in the embedding space, as well as (b) the degree to which each axis contributes to semantic change. Experimental results show that SCDTour preserves performance in semantic change detection while maintaining high interpretability. Moreover, agglomerating the sorted axes produces a more refined set of word senses, which achieves comparable or improved performance against the original full-dimensional embeddings in the SCD task. These findings demonstrate that SCDTour effectively balances interpretability and SCD performance, enabling meaningful interpretation of semantic shifts through a small number of refined axes.

SCDTour: Embedding Axis Ordering and Merging for Interpretable Semantic Change Detection

Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive. Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm. Despite this efficiency, existing methods are limited: edits may fail to inject knowledge (UnderEdit) or unintentionally disrupt unrelated neighboring knowledge (OverEdit). To address these challenges, we propose two complementary methods: **iterative model editing**, which applies successive edits to mitigate UnderEdit, and **neighbor-assisted model editing**, which incorporates neighboring knowledge during editing to reduce OverEdit. Our extensive experiments show that these techniques improve editing performance across multiple LLMs, algorithms, and benchmarks, reducing UnderEdit by up to 38 percentage points and OverEdit by up to 6, while remaining broadly applicable to any locate-and-edit method.

Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Analyzing Socially Unacceptable Discourse (SUD) online is a critical challenge for regulators and platforms amidst growing concerns over harmful content. While Pre-trained Masked Language Models (PMLMs) have proven effective for many NLP tasks, their performance often degrades in multi-label SUD classification due to overlapping linguistic cues across categories. In this work, we propose an artifact-guided pre-training strategy that injects statistically salient linguistic features, referred to as artifacts, into the masked language modelling objective. By leveraging context-sensitive tokens, we guide an importance-weighted masking scheme during pre-training to enhance generalization across discourse types. We further use these artifact signals to inform a lightweight dataset curation procedure that highlights noisy or ambiguous instances. This supports targeted relabeling and filtering, enabling more explainable and consistent annotation with minimal changes to the original data. Our approach provides consistent improvements in 10 datasets extensively used in SUD classification benchmarks. *Disclaimer: This article contains some extracts of unacceptable and upsetting language.*

[MASK]ED - Language Modeling for Explainable Classification and Disentangling of Socially Unacceptable Discourse.

As interest grows in the application of natural language processing (NLP) techniques to mental health, a growing body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are negatively biased or inaccurate thought patterns that adversely affect the way an individual perceives themselves and the world around them. Identifying and addressing them is an important part of therapy. Despite its momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices. This survey reviews 38 studies spanning two decades, providing a structured overview of modelling approaches, datasets, and evaluation strategies. We propose a canonical CD taxonomy, summarise common task setups, and highlight open challenges to support more coherent and reproducible research in this emerging area.

A Survey of Cognitive Distortion Detection and Classification in NLP

As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks—where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical—remains understudied. In this paper, we constructed ComplexEval Bench, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation

Alzheimer's Disease (AD), the 7th leading cause of death globally, demands scalable methods for early detection. While speech-based diagnostics offer promise, existing approaches struggle with temporal-spatial (T-S) challenges in capturing subtle linguistic shifts across different disease stages (temporal) and in adapting to cross-linguistic variability (spatial). This study introduces an LLM-driven T-S fusion, a novel framework that synergizes multilingual large language models (LLMs), contrastive learning, and interpretable marker discovery to revolutionize LOAD (Late-Onset AD) detection. Our key innovations include: (1) T-S Data Imputation: Leveraging LLMs to generate synthetic speech transcripts across different LOAD stages (NC, eMCI, lMCI, LOAD) and languages (Chinese/English/Spanish), addressing data scarcity while preserving clinical relevance (expert validation: 86% agreement with LLM-generated labels). (2) T-S Transformer with Contrastive Learning: A multilingual model that disentangles stage-specific (temporal) and language-specific (spatial) patterns, achieving a notable improvement of 10.9–24.7% in F1-score over existing baselines. (3) Cross-Linguistic Marker Discovery: Identifying language-agnostic markers and language-specific patterns to enhance interpretability for clinical adoption. By unifying temporal LOAD stages and spatial diversity, our framework achieves state-of-the-art performance in early LOAD detection while enabling cross-linguistic diagnostics. This study bridges NLP and clinical neuroscience, demonstrating LLMs' potential to amplify limited biomedical data and advance equitable healthcare AI.

An LLM-based Temporal-spatial Data Generation and Fusion Approach for Early Detection of Late Onset Alzheimer's Disease (LOAD) Stagings Especially in Chinese and English-speaking Populations

Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.

BiasFilter: An Inference-Time Debiasing Framework for Large Language Models

Aggregating multiple annotations into a single ground truth label may hide valuable insights into annotator disagreement, particularly in tasks where subjectivity plays a crucial role. In this work, we explore methods for identifying subjectivity in recognizing the human values that motivate arguments. We evaluate two main approaches: inferring subjectivity through value prediction vs. directly identifying subjectivity. Our experiments show that direct subjectivity identification significantly improves the model performance of flagging subjective arguments. Furthermore, combining contrastive loss with binary cross-entropy loss does not improve performance but reduces the dependency on per-label subjectivity. Our proposed methods can help identify arguments that individuals may interpret differently, fostering a more nuanced annotation process.

Will Annotators Disagree? Identifying Subjectivity in Value-Laden Arguments

Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation procedure is effective at improving model performance on simple, conversational text.

Adapting Large Language Models for Character-based Augmentative and Alternative Communication

Cognitive science offers rich theories of learning and communication, yet these are often difficult to operationalize at scale. We demonstrate how natural language processing can bridge this gap by applying psycholinguistic theories of discourse to real-world educational data. We investigate linguistic alignment - the convergence of conversational partners’ word choice, grammar, and meaning - as a measure of interactive alignment. Using a longitudinal dataset of real-world tutoring interactions and associated student test scores, we examine (1) the extent of alignment, (2) role-based patterns among tutors and students, and (3) the relationship between alignment and learning outcomes. We find that both tutors and students exhibit lexical, syntactic, and semantic alignment, with tutors aligning more strongly overall. Crucially, lexical alignment predicts student learning gains. As a lightweight, interpretable, metric, linguistic alignment offers practical applications in intelligent tutoring systems, educator dashboards, and tutor training.

Downloads

Next from EMNLP 2025

SCDTour: Embedding Axis Ordering and Merging for Interpretable Semantic Change Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES