China

Quantifying uncertainty in black-box LLMs is vital for reliable responses and scalable oversight. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the target query, can be misleading: an LLM may confidently provide an incorrect answer to a target query, yet give a confident and accurate answer to that same target query when answering a knowledge-preserving perturbation of the query. We systematically analyze the model behaviors and demonstrate that this discrepancy stems from suboptimal retrieval of parametric knowledge, often due to contextual biases that prevent consistent access to stored knowledge. We then introduce DiverseAgentEntropy, a novel, theoretically-grounded method employing multi-agent interaction across diverse query variations for uncertainty estimation of black-box LLMs. This approach more accurately assesses an LLM&#39;s true uncertainty and improves hallucination detection, outperforming existing self-consistency based techniques.

EMNLP 2025

Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty

multi-agent interaction

hallucination detection

uncertainty quantification

Quantifying uncertainty in black-box LLMs is vital for reliable responses and scalable oversight. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the target query, can be misleading: an LLM may confidently provide an incorrect answer to a target query, yet give a confident and accurate answer to that same target query when answering a knowledge-preserving perturbation of the query. We systematically analyze the model behaviors and demonstrate that this discrepancy stems from suboptimal retrieval of parametric knowledge, often due to contextual biases that prevent consistent access to stored knowledge. We then introduce DiverseAgentEntropy, a novel, theoretically-grounded method employing multi-agent interaction across diverse query variations for uncertainty estimation of black-box LLMs. This approach more accurately assesses an LLM's true uncertainty and improves hallucination detection, outperforming existing self-consistency based techniques.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Open-ended Event Forecasting (OEEF) is vital in various real-world applications. However, it faces challenges, including limited availability of datasets that enhance LLM's predictive capabilities and crude methods of organizing forecast-related information. In this work, we construct a large-scale dataset NewsForest that contains 12,406 prediction chains reflecting the drivers of event development. To effectively extract information from the prediction background, we propose a prediction method, ForestCast. ForestCast organizes all relevant news into a story tree and predicts each branch based on the story tree. ForestCast has five main steps: (1) collecting and cleaning news, (2) clustering news into event nodes, (3) constructing the news story tree, (4) mining the semantic structure of the news story tree, (5) predicting the next node and evaluating the quality of the predictions. Experiments demonstrate that the NewsForest dataset enhances the model’s ability to predict these structures. The ForestCast method improves the accuracy and quality of predictions.

ForestCast: Open-Ended Event Forecasting with Semantic News Forest

To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called XDeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

The spread of fake news on social media poses a serious threat to public trust and societal stability. While propagation-based methods improve fake news detection by modeling how information spreads, they often suffer from incomplete propagation data. Recent work leverages large language models (LLMs) to generate synthetic propagation, but typically overlooks the structural patterns of real-world discussions. We propose a novel structure-aware synthetic propagation enhanced detection (StruSP) framework to fully capture structural dynamics from real propagation and enable LLMs to generate realistic and structurally consistent propagation for better detection. Different from existing methods that rely on role-playing simulations for propagation generation, our StruSP explicitly aligns synthetic propagation with real-world propagation in both semantic and structural dimensions. Besides, we also design a new bidirectional evolutionary propagation (BEP) learning strategy to better align LLMs with structural patterns of propagation in the real world via structure-aware hybrid sampling and masked propagation modeling objective. Experiments on three public datasets demonstrate that StruSP significantly improves fake news detection performance in different settings. Further analysis indicates that BEP enables the LLM to generate more realistic and diverse propagation.

Structure-aware Propagation Generation with Large Language Models for Fake News Detection

Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct “breakthrough” in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

With the advancement of large language models (LLMs), more concerns about their controllability have been raised in recent research. In this paper, we argue for the importance of Knowledge-Constrained Responsiveness (KCR), ensuring that LLMs comply with human-defined constraints. However, KCR is an implicit and unobservable capability of LLMs, functioning as a black box that currently cannot be quantitatively assessed. To address this, we first introduce the definition of "permitted boundary" and define the "boundary bias" to depict KCR. We propose six metrics to quantify the boundary bias of LLMs and subsequently assess the KCR. Furthermore, we establish a benchmark with two new datasets, KCR-SimpleQA and KCR-WebNLG, to evaluate the performance of LLMs. Our extensive experiments show that several tested LLMs still struggle to varying degrees when adhering to constraints, especially without the corresponding knowledge.

Permitted Knowledge Boundary: Evaluating the Knowledge-Constrained Responsiveness of Large Language Models

This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the middle layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.

How a Bilingual LM Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders

In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce CoT to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Safety alignment is critical in pre-trained large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose textitAdversary-aware DPO (ADPO), a novel training framework that explicitly considers adversary. textitAdversary-aware DPO (ADPO) integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. textitADPO introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversary-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, textitADPO ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that textitADPO outperforms baselines in terms of both safety alignment and general utility of VLMs.

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Link prediction in knowledge graphs (KGs) requires integrating structural information and semantic context to infer missing entities. While large language models (LLMs) offer strong generative reasoning capabilities, they often struggle with *structural sparsity*, *semantic ambiguity*, and limited exploitation of structural signals, especially under incomplete or zero-shot settings. To address these challenges, we propose **SLiNT** (**S**tructure-aware **L**anguage model with **I**njection and co**N**trastive **T**raining), a modular framework that injects KG-derived structural context into frozen LLMs for robust link prediction. Specifically, **Structure-Guided Neighborhood Enhancement (SGNE)** retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; **Dynamic Hard Contrastive Learning (DHCL)** introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and **Gradient-Decoupled Dual Injection (GDDI)** performs token-level structure-aware intervention without altering the LLM backbone. Experiments on WN18RR and FB15k-237 show that SLiNT outperforms both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.

SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

Downloads

Next from EMNLP 2025

ForestCast: Open-Ended Event Forecasting with Semantic News Forest

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES