China

The code generation capabilities of Large Language Models (LLMs) have advanced applications like tool invocation and problem-solving. However, improving performance in code-related tasks remains challenging due to limited training data that is verifiable with accurate test cases. While Direct Preference Optimization (DPO) has shown promise, existing methods for generating test cases still face limitations. In this paper, we propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench. Code and data are available at https://anonymous.4open.science/r/StructureCoder-A3B5.

EMNLP 2025

Alignment with Fill-In-the-Middle for Enhancing Code Generation

fill in the middle

direct preference optimization

code generation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models~(LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps mitigate this, but at a high computational cost while risking misinformation. Adaptive retrieval aims to retrieve only when necessary, but existing approaches rely on LLM-based uncertainty estimation, which remain inefficient and impractical. In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information. We investigated 27 features, organized into 7 groups, and their hybrid combinations. We evaluated these methods on 6 QA datasets, assessing the QA performance and efficiency. The results show that our approach matches the performance of complex LLM-based methods while achieving significant efficiency gains, demonstrating the potential of external information for adaptive retrieval.

LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Uncertain knowledge graph embedding (UnKGE) methods learn vector representations that capture both structural and uncertainty information to predict scores of unseen triples. However, existing methods produce only point estimates, without quantifying predictive uncertainty—limiting their reliability in high-stakes applications where understanding confidence in predictions is crucial. To address this limitation, we propose \textsc{UnKGCP}, a framework that generates prediction intervals guaranteed to contain the true score with a user-specified level of confidence. The length of the intervals reflects the model’s predictive uncertainty. \textsc{UnKGCP} builds on the conformal prediction framework but introduces a novel nonconformity measure tailored to UnKGE methods and an efficient procedure for interval construction. We provide theoretical guarantees for the intervals and empirically verify these guarantees. Extensive experiments on standard UKG benchmarks across diverse UnKGE methods further demonstrate that the intervals are sharp and effectively capture predictive uncertainty.

Certainty in Uncertainty: Reasoning over Uncertain Knowledge Graphs with Statistical Guarantees

Although auto-regressive models excel in natural language processing, they often struggle to generate diverse text and provide limited controllability. Non-auto-regressive methods could be an alternative but often produce degenerate outputs and exhibit shortcomings in conditional generation. To address these challenges, we propose Diffusion-EAGS, a novel framework that integrates conditional masked language models into diffusion language models through the theoretical lens of a conditional Markov Random Field. In doing so, we propose entropy-adaptive Gibbs sampling and entropy-based noise scheduling to counterbalance each model’s shortcomings. Experimental results show that Diffusion-EAGS outperforms baselines and achieves the best quality-diversity tradeoff, demonstrating its effectiveness in non-autoregressive text generation.

Conditional [MASK] Discrete Diffusion Language Model

This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O(n²) to O(n) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance. Our code and datasets will be made available.

CROP: Contextual Region-Oriented Visual Token Pruning

The DIFF Transformer mitigates interference from irrelevant contexts by introducing a differential attention mechanism, thereby enhancing focus on critical tokens. However, this architecture suffers from two major limitations: first, its use of two independent attention matrices leads to numerical instability, and second, it lacks global context modeling, which is essential for identifying globally significant tokens. To address these challenges, we propose the DINT Transformer, which extends the DIFF Transformer by incorporating an integral mechanism. By computing global importance scores and integrating them into the attention matrix, the DINT Transformer not only improves overall numerical stability but also significantly enhances its ability to capture global dependencies. Experimental results demonstrate that the DINT Transformer achieves superior accuracy and robustness across various practical applications, including long-context language modeling and key information retrieval. These advancements establish the DINT Transformer as a highly effective and promising architecture.

DINT Transformer

Large language models (LLMs) can generate fluent text, raising concerns about misuse in online comments and academic writing, leading to issues like corpus pollution and copyright infringement. Existing LLM text detection methods often rely on features from the logit distribution of the input text. However, the distinction between the LLM-generated and human-written texts may rely on only a few tokens due to the short length or insufficient information in some texts, leading to minimal and hard-to-detect differences in logit distributions. To address this, we propose HALO, an LLM-based detection method that leverages external text corpora to evaluate the difference of logit distribution of input text under retrieved human-written and LLM-rewritten contexts. We find that LLM-generated texts show significantly greater consistency across varied contexts than human-written texts. HALO also complements basic detection features and can be served as a plug-and-play module to enhance existing detection methods. Extensive experiments on five public datasets with three widely-used source LLMs show that our proposed detection method achieves state-of-the-art performance in AUROC, both in cross-domain and domain-specific scenarios.

Enhancing LLM Text Detection with Retrieved Contexts and Logits Distribution Consistency

Open-world knowledge graph completion (KGC) aims to infer novel facts by enriching existing graphs with external knowledge sources while maintaining semantic consistency under the open-world assumption (OWA). Generation-based KGC methods leverage the inherent strengths of large language models (LLMs) in language understanding and creative problem-solving, making them promising approaches. However, they face limitations: (1) The unreliable external knowledge from LLMs can lead to hallucinations and undermine KGC reliability. (2) The lack of an automated and rational evaluation strategy for new facts under OWA results in the exclusion of some new but correct entities. In the paper, we propose MusKGC, a novel multi-source knowledge enhancement framework based on an LLM for KGC under OWA. We induce relation templates with entity type constraints to link structured knowledge with natural language, improving the comprehension of the LLM. Next, we combine intrinsic KG facts with reliable external knowledge to guide the LLM in accurately generating missing entities with supporting evidence. Lastly, we introduce a new evaluation strategy for factuality and consistency to validate accurate inferences of new facts, including unknown entities. Extensive experiments show that our proposed model achieves SOTA performance across benchmarks, and our evaluation strategy effectively assesses new facts under OWA.

MusKGC: A Flexible Multi-source Knowledge Enhancement Framework for Open-World Knowledge Graph Completion

Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing LLM-based multi-agent systems mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to concrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process, so we propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. This shows the potential of communication augmentation for LLM-based multi-agent systems. We have open-sourced all the code and data in https://anonymous.4open.science/r/StateDeltaEncoding/.

Augmenting Multi-Agent Communication with State Delta Trajectory

We propose a new approach for the author002 ship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based mod005 els. We evaluate our approach on two pop006 ular authorship attribution models and three evaluation datasets, in in-domain and out-of008 domain scenarios. We found that utilizing vari009 ous transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in new state012 of-the-art results. Our analysis gives further insights into how our model’s different layers get specialized in representing certain stylistic features that benefit the model when tested out of the domain.

Downloads

Next from EMNLP 2025

LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES