China

Large language models (LLMs) can generate fluent text, raising concerns about misuse in online comments and academic writing, leading to issues like corpus pollution and copyright infringement. Existing LLM text detection methods often rely on features from the logit distribution of the input text. However, the distinction between the LLM-generated and human-written texts may rely on only a few tokens due to the short length or insufficient information in some texts, leading to minimal and hard-to-detect differences in logit distributions. To address this, we propose HALO, an LLM-based detection method that leverages external text corpora to evaluate the difference of logit distribution of input text under retrieved human-written and LLM-rewritten contexts. We find that LLM-generated texts show significantly greater consistency across varied contexts than human-written texts. HALO also complements basic detection features and can be served as a plug-and-play module to enhance existing detection methods. Extensive experiments on five public datasets with three widely-used source LLMs show that our proposed detection method achieves state-of-the-art performance in AUROC, both in cross-domain and domain-specific scenarios.

EMNLP 2025

Enhancing LLM Text Detection with Retrieved Contexts and Logits Distribution Consistency

logits distribution consistency

llm text detection

security/privacy

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Open-world knowledge graph completion (KGC) aims to infer novel facts by enriching existing graphs with external knowledge sources while maintaining semantic consistency under the open-world assumption (OWA). Generation-based KGC methods leverage the inherent strengths of large language models (LLMs) in language understanding and creative problem-solving, making them promising approaches. However, they face limitations: (1) The unreliable external knowledge from LLMs can lead to hallucinations and undermine KGC reliability. (2) The lack of an automated and rational evaluation strategy for new facts under OWA results in the exclusion of some new but correct entities. In the paper, we propose MusKGC, a novel multi-source knowledge enhancement framework based on an LLM for KGC under OWA. We induce relation templates with entity type constraints to link structured knowledge with natural language, improving the comprehension of the LLM. Next, we combine intrinsic KG facts with reliable external knowledge to guide the LLM in accurately generating missing entities with supporting evidence. Lastly, we introduce a new evaluation strategy for factuality and consistency to validate accurate inferences of new facts, including unknown entities. Extensive experiments show that our proposed model achieves SOTA performance across benchmarks, and our evaluation strategy effectively assesses new facts under OWA.

MusKGC: A Flexible Multi-source Knowledge Enhancement Framework for Open-World Knowledge Graph Completion

Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing LLM-based multi-agent systems mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to concrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process, so we propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. This shows the potential of communication augmentation for LLM-based multi-agent systems. We have open-sourced all the code and data in https://anonymous.4open.science/r/StateDeltaEncoding/.

Augmenting Multi-Agent Communication with State Delta Trajectory

We propose a new approach for the author002 ship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based mod005 els. We evaluate our approach on two pop006 ular authorship attribution models and three evaluation datasets, in in-domain and out-of008 domain scenarios. We found that utilizing vari009 ous transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in new state012 of-the-art results. Our analysis gives further insights into how our model’s different layers get specialized in representing certain stylistic features that benefit the model when tested out of the domain.

Layered Insights: Generalizable Analysis of Human Authorial Style by Leveraging All Transformer Layers

Knowledge Editing is a technique that updates large language models (LLMs) with new information to maintain their world knowledge. This approach avoids the need to rebuild the model from scratch, thereby addressing the high costs associated with frequent retraining. Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model's original capabilities. Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples. Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining. EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge. We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks. The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating.

Knowledge Editing through Chain-of-Thought

We examine how multimodal large language models (MLLMs) perform logical inference grounded in visual information. We first construct a dataset of food web/chain images, along with questions that follow seven structured templates with progressively more complex reasoning involved. We show that complex reasoning about entities in the images remains challenging (even with elaborate prompts) and that visual information is underutilized.

Probing Logical Reasoning of MLLMs in Scientific Diagrams

Video question answering benefits from the rich information available in videos, enabling a wide range of applications. However, the large volume of tokens generated from longer videos presents significant challenges to memory efficiency and model performance. To alleviate this issue, existing works propose to compress video inputs, but usually overlooking the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. To tackle this, we propose a novel token selection strategy, \textsc{explore-then-select}, that adaptively adjust static and dynamic information needed based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Next, it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our proposed framework is plug-and-play that can be seamlessly integrated within diverse video-language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) among various video question answering benchmarks. The code is accessible at the anonymous \href{https://anonymous.4open.science/r/Static\_Or\_Dynamic}{link}.

Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent LLM benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models’ ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates large language models (LLMs) on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 17 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes—highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.

DischargeSim: A Simulation Benchmark for Educational Doctor–Patient Communication at Discharge

Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multistep visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.

Can Vision-Language Models Solve Visual Math Equations?

Retrieval-augmented generation (RAG) is a cost-effective approach to mitigate the hallucination of Large Language Models (LLMs) by incorporating the retrieved external knowledge into the generation process. However, external knowledge may conflict with the parametric knowledge of LLMs. Furthermore, current LLMs lack inherent mechanisms for resolving such knowledge conflicts, making traditional RAG methods suffer from degraded performance and stability. Thus, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to the framework is a novel approach that refines self-attention into a mixed-attention, distinguishing shared and private semantics for a controlled internal-external knowledge integration. To effectively facilitate DSSP in RAG, we further introduce an unsupervised hallucination detection method based on cognitive uncertainty, ensuring the necessity of introducing knowledge, and an Energy Quotient (EQ) based on attention difference matrices to reduce noise in the retrieved external knowledge. Extensive experiments on benchmark datasets show that DSSP-RAG can effectively resolve conflicts and enhance the complementarity of dual-stream knowledge, leading to superior performance over strong baselines.

Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

Multimodal Sentiment Analysis (MSA) is the task of understanding human emotions by analyzing a combination of different data sources, such as text, audio, and visual inputs. Although recent advances have improved emotion modeling across modalities, existing methods still struggle with two fundamental challenges: balancing global and fine-grained sentiment contributions, and over-reliance on the text modality. To address these issues, we propose DPDF-LQ (Dual-Path Dynamic Fusion with Learnable Query), an architecture that processes inputs through two complementary paths: global and local. The global path is responsible for establishing cross-modal dependencies, while the local path captures fine-grained representations. Additionally, we introduce the key module Dynamic Global Learnable Query Attention (DGLQA) in the global path, which dynamically allocates weights to each modality to capture their relevant features and learn global representations. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks demonstrate that DPDF-LQ achieves state-of-the-art performance, particularly in fine-grained sentiment prediction, by effectively combining global and local features.

Downloads

Next from EMNLP 2025

MusKGC: A Flexible Multi-source Knowledge Enhancement Framework for Open-World Knowledge Graph Completion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES