China

Recent spoken dialogue systems employ large language models with advanced reasoning capabilities as their core architecture. However, there exists a discrepancy between optimal textual and verbal delivery, which presents challenges in effectively leveraging the reasoning process for spoken communication. Although some efforts aim to adapt language models for more speech-suitable delivery, the impact of these modifications on the models&#39; reasoning capabilities remains underexplored. In this work, we propose the Think-Verbalize-Speak framework that separates the reasoning process from the spoken content to fully harness the reasoning capabilities of LLMs in spoken dialogue. Specifically, we introduce an intermediate step between thinking and speaking, termed \textit{verbalizing}, in which the thought process is translated into comprehensible text. We also present ReVerT, a latency-efficient implementation of the verbalizer using incremental and asynchronous summarization. Extensive automatic and human evaluations across multiple benchmarks demonstrate that our approach improves speech naturalness and conciseness with minimal compromise to reasoning capabilities. We release both the dataset and its construction pipeline to facilitate future research.

EMNLP 2025

Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech

Recent spoken dialogue systems employ large language models with advanced reasoning capabilities as their core architecture. However, there exists a discrepancy between optimal textual and verbal delivery, which presents challenges in effectively leveraging the reasoning process for spoken communication. Although some efforts aim to adapt language models for more speech-suitable delivery, the impact of these modifications on the models' reasoning capabilities remains underexplored. In this work, we propose the Think-Verbalize-Speak framework that separates the reasoning process from the spoken content to fully harness the reasoning capabilities of LLMs in spoken dialogue. Specifically, we introduce an intermediate step between thinking and speaking, termed \textit{verbalizing}, in which the thought process is translated into comprehensible text. We also present ReVerT, a latency-efficient implementation of the verbalizer using incremental and asynchronous summarization. Extensive automatic and human evaluations across multiple benchmarks demonstrate that our approach improves speech naturalness and conciseness with minimal compromise to reasoning capabilities. We release both the dataset and its construction pipeline to facilitate future research.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Effective interactions between artificial intelligence (AI) and humans require an equitable and accurate representation of diverse cultures. It is known that current AI, particularly large language models (LLMs), possesses some degrees of cultural knowledge but not without limitations. We present a framework aimed at understanding the origin of these limitations. We hypothesize that there is a fundamental discordance between embedded ethics—how LLMs represent right versus wrong, and cultural inference—how LLMs infer cultural knowledge, specifically cultural norms. We demonstrate this by extracting low-dimensional subspaces that embed ethical principles of LLMs based on established benchmarks. We then show that how LLMs make errors in culturally distinctive scenarios significantly correlates with how they represent cultural norms with respect to these embedded ethics subspaces. Furthermore, we show that coercing cultural norms to be more aligned with the embedded ethics increases LLM performance in cultural inference. Our analyses of 12 language models, two large-scale cultural benchmarks spanning 75 countries and two ethical datasets indicate that 1) the ethics-culture discordance tends to be exacerbated in instruct-tuned models, and 2) how current LLMs represent ethics can impose limitations on their adaptation to diverse cultures particularly pertaining to non-Western and low-income regions.

The discordance between embedded ethics and cultural inference in large language models

Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client's negative emotional state. Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client's statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance. These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist’s ability to handle resistance, which outperforms existing text-based CBT approaches. Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.

MIRROR: Multimodal Cognitive Reframing Therapy for Rolling with Resistance

The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. Our code is available at \href{https://anonymous.4open.science/r/emnlp_d2d-36E2/}{\texttt{4open.science/emnlp\_d2d-36E2}}.

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Consensus formation is pivotal in multi-agent systems (MAS), balancing collective coherence with individual diversity. Conventional LLM-based MAS primarily rely on explicit coordination, e.g., prompts or voting, risking premature homogenization. We argue that implicit consensus, where agents exchange information yet independently form decisions via in-context learning, can be more effective in dynamic environments that require long-horizon adaptability. By retaining partial diversity, systems can better explore novel strategies and cope with external shocks. We formalize a consensus-diversity tradeoff, showing conditions where implicit methods outperform explicit ones. Experiments on three scenarios -- Dynamic Disaster Response, Information Spread and Manipulation, and Dynamic Public-Goods Provision -- confirm partial deviation from group norms boosts exploration, robustness, and performance. We highlight emergent coordination via in-context learning, underscoring the value of preserving diversity for resilient decision-making.

The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems

Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.

KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

Understanding the overall stance of news articles is challenging due to their length and structural complexity. Yet, it is essential for supporting pluralistic and credible media environments. This paper introduces a novel stance detection dataset for Korean news, featuring annotations at both the article level and the segment level, informed by the narrative structure of news articles. Building on this resource, we propose an agentic in-context learning method that prompts a large language model (LLM) with segment-level stance predictions generated by a language model agent. Experiments across multiple LLMs demonstrate the effectiveness of the proposed framework for article-level stance detection and highlight its broader utility in enhancing diverse news recommendations and analyzing patterns of media bias.

Journalism-Guided Agentic In-context Learning for News Stance Detection

To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, modify adversarial prompts to induce LLMs to generate responses that strictly follow a fixed affirmative template. However, we observed that the reliance on the rigid output template is ineffective for certain malicious requests, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all hostile requests. To achieve this, we explore LLMs' intrinsic safety mechanism: a refusal stance towards the adversarial prompt is formed in a confined region and ultimately leads to a rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generates jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate the superiority of SM's performance. Under commenly used setting, SM achieves success rates over 77.1% across all models on Advbench. Specifically, for Llama-2-7b-chat, SM outperforms the best baseline by 25.4%. In further experiments with extended iterations in a speedup setup, SM achieves over 92.2% attack success rate across all models. Our code is publicly available at https://anonymous.4open.science/r/Stance-Manipulation-D5F0

Jailbreak LLMs through Internal Stance Manipulation

Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.

The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas

State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior. We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas. Using a 6-dimensional persona space (age, gender, country, class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence. Persuasive success varies across traits, with liberal and open personalities reaching higher consensus and win rates. While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time. These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning.

Downloads

Next from EMNLP 2025

The discordance between embedded ethics and cultural inference in large language models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES