China

Subjectivity in NLP tasks, _e.g._, toxicity classification, has emerged as a critical challenge precipitated by the increased deployment of NLP systems in content-sensitive domains. Conventional approaches aggregate annotator judgements (labels), ignoring minority perspectives, and overlooking the influence of the sociocultural context behind such annotations. We propose a framework where subjectivity in binary labels is modeled as an empirical distribution accounting for the variation in annotators through human values extracted from sociocultural descriptors using a language model. The framework also allows for downstream tasks such as population and sociocultural group-level majority label prediction. Experiments on three toxicity datasets covering human-chatbot conversations and social media posts annotated with diverse annotator pools demonstrate that our approach yields well-calibrated toxicity distribution predictions across binary toxicity labels, which are further used for majority label prediction across cultural subgroups, improving over existing methods.

EMNLP 2025

Learning Subjective Label Distributions via Sociocultural Descriptors

value-centered evaluation

annotator subjectivity

values and culture

human factors in nlp

human-centered design

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Generative Large Language Models (LLMs) infer user's demographic information from subtle cues in the conversation --- a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models' latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model's internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.

Reading Between the Prompts: How Stereotypes Shape LLM’s Implicit Personalization

Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities.Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.

Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

End-to-end automatic speech recognition systems often fail to transcribe domain-speciffcnamed entities, causing catastrophic failuresin downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when theforms of the wrongly-transcribed words(s) and the ground-truth entity are signiffcantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entityerrors in ASR transcripts and replace the textwith correct entities. This method is effective inscenarios of word form difference. We test ourmethod using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring signiffcant improvement to entity accuracy. We will open source our self constructed test set and training data.

Generative Annotation for ASR Named Entity Correction

Temporal evolution attribute graph prediction, a key task in graph machine learning, aims to forecast the dynamic evolution of node attributes over time. While recent advances in Large Language Models (LLMs) have enabled their use in enhancing node representations for integration with Graph Neural Networks (GNNs), their potential to directly perform GNN-like aggregation and interaction remains underexplored. Furthermore, traditional approaches to initializing attribute embeddings often disregard structural semantics, limiting the provision of rich prior knowledge to GNNs. Current methods also primarily focus on 1-hop neighborhood aggregation, lacking the capability to capture complex structural interactions. To address these limitations, we propose a novel prediction framework that integrates structural information into attribute embeddings through the introduction of an attribute embedding loss. We design specialized prompts to enable LLMs to perform GNN-like aggregation and incorporate a relation-aware Graph Convolutional Network to effectively capture long-range and complex structural dependencies. Extensive experiments on multiple real-world datasets validate the effectiveness of our approach, demonstrating significant improvements in predictive performance over existing methods.

LGA: LLM-GNN Aggregation for Temporal Evolution Attribute Graph Prediction

Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational complexity, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.

Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

Autoregressive speech token generation models produce speech with remarkable variety and naturalness but often suffer from hallucinations and undesired vocalizations that do not conform to conditioning inputs. To address these challenges, we introduce Koel-TTS, an encoder-decoder transformer model for multilingual TTS that improves contextual adherence of speech generation LLMs through preference alignment and classifier-free guidance (CFG). For preference alignment, we design a reward system that ranks model outputs using automatic metrics derived from speech recognition and speaker verification models, encouraging generations that better match the input text and speaker identity. CFG further allows fine-grained control over the influence of conditioning inputs during inference by interpolating conditional and unconditional logits. Notably, applying CFG to a preference-aligned model yields additional gains in transcription accuracy and speaker similarity, demonstrating the complementary benefits of both techniques. Koel-TTS achieves state-of-the-art results in zero-shot TTS, outperforming prior LLM-based models on intelligibility, speaker similarity, and naturalness, despite being trained on significantly less data.

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

In recent years, Large Language Models (LLMs) have made significant progress in emotional support dialogue. However, there are two major challenges for LLM-based support systems. First, users may be hesitant to fully disclose their emotions at the outset. Second, direct probing or excessive questioning can induce discomfort or even resistance. To bridge this gap, we propose COCOON, a proactive emotional support framework that leverages principles of active listening to uncover implicit user needs. We design a multi-stage data curation pipeline and a self-learning mechanism for annotating support strategies. Based on this framework, we build COCOON-Llama3, a fine-tuned large language model, and evaluate it using both standard metrics and psychological scales. Experimental results indicate that our model more effectively elicits implicit emotional needs and delivers empathetic support compared to existing baselines, suggesting its utility for building more inclusive emotional support dialogue systems.

Look Beyond Feeling: Unveiling Latent Needs from Implicit Expressions for Proactive Emotional Support

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes.

FuseChat: Knowledge Fusion of Chat Models

End-to-end automatic speech recognition (ASR) based on deep learning has achieved impressive progress in recent years. However, the performance of ASR foundation model often degrades significantly on out-of-domain data due to real-world domain shifts. Test-Time Adaptation (TTA) methods aim to mitigate this issue by adapting models during inference without access to source data. Despite recent progress, existing ASR TTA methods often struggle with instability under continual and long-term distribution shifts. To alleviate the risk of performance collapse due to error accumulation, we propose Dynamic Model-bank Single-Utterance Test-time Adaptation (DMSUTA), a sustainable continual TTA framework based on adaptive ASR model ensembling. DMSUTA maintains a dynamic model bank, from which a subset of checkpoints is selected for each test sample based on confidence and uncertainty criteria. To preserve both model plasticity and long-term stability, DMSUTA actively manages the bank by filtering out potentially collapsed models. This design allows DMSUTA to continually adapt to evolving domain shifts in ASR test-time scenarios. Experiments on diverse, continuously shifting ASR TTA benchmarks show that DMSUTA consistently outperforms existing continual TTA baselines, demonstrating superior robustness to domain shifts in ASR.

Dynamic Model-Bank Test-Time Adaptation for Automatic Speech Recognition

Goal-oriented dialogues, such as recommendation and negotiation, often require balancing multiple, conflicting objectives. Existing methods typically involve training separate models for specific combinations of objectives, leading to computational and scalability issues. In this work, we aim to develop a new dialogue policy method that can adapt to varying objective preferences at inference time without retraining. This raises several challenges in terms of both (1) optimization strategy and (2) knowledge utilization. To address these, we propose a novel learning framework, Preference Adaptive Dialogue Policy Planner (PADPP), for multi-objective goal-oriented dialogues. Specifically, to tackle the former, we introduce a novel policy optimization scheme, which leverages information gained from training the model on previously updated objective weights, accelerating the learning capability on new weight settings. To address the latter, we utilize Generalized Policy Improvement (GPI) to ensure the effectiveness of leveraged knowledge. Experimental results demonstrate that PADPP achieves superior adaptability and performance compared to state-of-the-art approaches, offering a scalable and flexible solution for multi-objective, goal-oriented dialogues. Code and data are available at the anonymous link.

Downloads

Next from EMNLP 2025

Reading Between the Prompts: How Stereotypes Shape LLM’s Implicit Personalization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES