China

End-to-end automatic speech recognition systems often fail to transcribe domain-speciffcnamed entities, causing catastrophic failuresin downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when theforms of the wrongly-transcribed words(s) and the ground-truth entity are signiffcantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entityerrors in ASR transcripts and replace the textwith correct entities. This method is effective inscenarios of word form difference. We test ourmethod using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring signiffcant improvement to entity accuracy. We will open source our self constructed test set and training data.

EMNLP 2025

Generative Annotation for ASR Named Entity Correction

speech entity error correction

speech recognition

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Temporal evolution attribute graph prediction, a key task in graph machine learning, aims to forecast the dynamic evolution of node attributes over time. While recent advances in Large Language Models (LLMs) have enabled their use in enhancing node representations for integration with Graph Neural Networks (GNNs), their potential to directly perform GNN-like aggregation and interaction remains underexplored. Furthermore, traditional approaches to initializing attribute embeddings often disregard structural semantics, limiting the provision of rich prior knowledge to GNNs. Current methods also primarily focus on 1-hop neighborhood aggregation, lacking the capability to capture complex structural interactions. To address these limitations, we propose a novel prediction framework that integrates structural information into attribute embeddings through the introduction of an attribute embedding loss. We design specialized prompts to enable LLMs to perform GNN-like aggregation and incorporate a relation-aware Graph Convolutional Network to effectively capture long-range and complex structural dependencies. Extensive experiments on multiple real-world datasets validate the effectiveness of our approach, demonstrating significant improvements in predictive performance over existing methods.

LGA: LLM-GNN Aggregation for Temporal Evolution Attribute Graph Prediction

Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational complexity, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.

Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

Autoregressive speech token generation models produce speech with remarkable variety and naturalness but often suffer from hallucinations and undesired vocalizations that do not conform to conditioning inputs. To address these challenges, we introduce Koel-TTS, an encoder-decoder transformer model for multilingual TTS that improves contextual adherence of speech generation LLMs through preference alignment and classifier-free guidance (CFG). For preference alignment, we design a reward system that ranks model outputs using automatic metrics derived from speech recognition and speaker verification models, encouraging generations that better match the input text and speaker identity. CFG further allows fine-grained control over the influence of conditioning inputs during inference by interpolating conditional and unconditional logits. Notably, applying CFG to a preference-aligned model yields additional gains in transcription accuracy and speaker similarity, demonstrating the complementary benefits of both techniques. Koel-TTS achieves state-of-the-art results in zero-shot TTS, outperforming prior LLM-based models on intelligibility, speaker similarity, and naturalness, despite being trained on significantly less data.

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

In recent years, Large Language Models (LLMs) have made significant progress in emotional support dialogue. However, there are two major challenges for LLM-based support systems. First, users may be hesitant to fully disclose their emotions at the outset. Second, direct probing or excessive questioning can induce discomfort or even resistance. To bridge this gap, we propose COCOON, a proactive emotional support framework that leverages principles of active listening to uncover implicit user needs. We design a multi-stage data curation pipeline and a self-learning mechanism for annotating support strategies. Based on this framework, we build COCOON-Llama3, a fine-tuned large language model, and evaluate it using both standard metrics and psychological scales. Experimental results indicate that our model more effectively elicits implicit emotional needs and delivers empathetic support compared to existing baselines, suggesting its utility for building more inclusive emotional support dialogue systems.

Look Beyond Feeling: Unveiling Latent Needs from Implicit Expressions for Proactive Emotional Support

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes.

FuseChat: Knowledge Fusion of Chat Models

End-to-end automatic speech recognition (ASR) based on deep learning has achieved impressive progress in recent years. However, the performance of ASR foundation model often degrades significantly on out-of-domain data due to real-world domain shifts. Test-Time Adaptation (TTA) methods aim to mitigate this issue by adapting models during inference without access to source data. Despite recent progress, existing ASR TTA methods often struggle with instability under continual and long-term distribution shifts. To alleviate the risk of performance collapse due to error accumulation, we propose Dynamic Model-bank Single-Utterance Test-time Adaptation (DMSUTA), a sustainable continual TTA framework based on adaptive ASR model ensembling. DMSUTA maintains a dynamic model bank, from which a subset of checkpoints is selected for each test sample based on confidence and uncertainty criteria. To preserve both model plasticity and long-term stability, DMSUTA actively manages the bank by filtering out potentially collapsed models. This design allows DMSUTA to continually adapt to evolving domain shifts in ASR test-time scenarios. Experiments on diverse, continuously shifting ASR TTA benchmarks show that DMSUTA consistently outperforms existing continual TTA baselines, demonstrating superior robustness to domain shifts in ASR.

Dynamic Model-Bank Test-Time Adaptation for Automatic Speech Recognition

Goal-oriented dialogues, such as recommendation and negotiation, often require balancing multiple, conflicting objectives. Existing methods typically involve training separate models for specific combinations of objectives, leading to computational and scalability issues. In this work, we aim to develop a new dialogue policy method that can adapt to varying objective preferences at inference time without retraining. This raises several challenges in terms of both (1) optimization strategy and (2) knowledge utilization. To address these, we propose a novel learning framework, Preference Adaptive Dialogue Policy Planner (PADPP), for multi-objective goal-oriented dialogues. Specifically, to tackle the former, we introduce a novel policy optimization scheme, which leverages information gained from training the model on previously updated objective weights, accelerating the learning capability on new weight settings. To address the latter, we utilize Generalized Policy Improvement (GPI) to ensure the effectiveness of leveraged knowledge. Experimental results demonstrate that PADPP achieves superior adaptability and performance compared to state-of-the-art approaches, offering a scalable and flexible solution for multi-objective, goal-oriented dialogues. Code and data are available at the anonymous link.

One Planner To Guide Them All ! Learning Adaptive Conversational Planners for Goal-oriented Dialogues

Text embeddings and text embedding models are bread-and-butter in many NLP tasks, especially in those that involve classification, clustering or search. However, interpretability challenges persist, especially in explaining obtained similarity scores, which is crucial for applications requiring transparency. In this piece, we give a structured overview of methods specializing in explaining those similarity scores, an interesting and fairly unexplored research area. Specifically, we highlight the methods' individual ideas and techniques, as well as their trade-offs, assessing their potential for interpreting text embeddings and explaining similarity. We finally outline opportunities and open challenges for future research.

Interpretable Text Embeddings and Text Similarity Explanation: A Survey

Detecting offensive language in Chinese is challenging due to homophonic substitutions used to evade detection. We propose a framework to improve large language models’ robustness against such phonetic attacks. First, we construct HED-COLD, a homophone-enhanced dataset based on the Chinese Offensive Language Dataset. Additionally, we propose a homophone-aware pretraining strategy that aligns semantics and fuses features to learn robust mappings between original and perturbed text. Experimental results show that our approach achieves state-of-the-art performance on both the COLD test set and the toxicity benchmark ToxiCloakCN. Notably, it achieves greater gains in domains especially prone to homophonic attacks, such as gender and regional content. These results demonstrate improved robustness and generalization against phonetic adversarial attacks.

Enhancing Chinese Offensive Language Detection with Homophonic Perturbation

Disentangling how gender and occupations are encoded by LLMs is crucial to identify possible biases and prevent harms, especially given the widespread use of LLMs in sensitive domains such as human resources. In this work, we carry out an in-depth investigation of gender and occupational biases in English and Italian as expressed by 9 different LLMs (both base and instruction-tuned). Specifically, we focus on the analysis of sentence completions when LLMs are prompted with job-related sentences including different gender representations. We carry out a manual analysis of 4,500 generated texts over 4 dimensions that can reflect bias, we propose a novel embedding-based method to investigate biases in generated texts and, finally, we carry out a lexical analysis of the model completions. In our qualitative and quantitative evaluation we show that many facets of social bias remain unaccounted for even in aligned models, and LLMs in general still reflect existing gender biases in both languages. Finally, we find that models still struggle with gender-neutral expressions, especially beyond English.

Downloads

Next from EMNLP 2025

LGA: LLM-GNN Aggregation for Temporal Evolution Attribute Graph Prediction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES