China

During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus for reproducibility.

EMNLP 2025

LVLMs are Bad at Overhearing Human Referential Communication

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Despite biases in Large Language Models (LLMs) being widely researched, systematic explorations of political biases in news article generation tasks remain underexplored. This study evaluates political bias across seven LLMs by leveraging our PublicViews dataset-extracted from the TwinViews-13K corpus-comprising 31 topics and 31,692 statements. We analyze 10,850 articles, finding left-leaning bias persists in generation tasks, with neutral content remaining rare even under balanced opinion settings. Models exhibit asymmetric behavior in minority opinion scenarios, amplifying preferred viewpoints when in minority while conforming to majority opinions otherwise. Notably, all models employ `stance-flipping quotations'' (altering supporters' statements to express opposite viewpoints) in 33-38\% of quotations despite explicit instructions against distortion. Consistent with prior research, increased model size failed to enhance neutrality. This research measures political bias in LLM-generated news, analyzes its mechanisms, and reveals how opinion distribution and explicitness affect bias expression. Our results highlight how LLMs can introduce unintended political bias in generative contexts. We publicly release our PublicViews corpus and code at https://anonymous.4open.science/r/Fair-or-Framed-46F1.

Fair or Framed? Political Bias in News Articles Generated by LLMs

Unlike traditional search engines that present ranked lists of webpages, generative search engines rely solely on in-line citations as the key gateway to real-world information exposure, making it crucial to examine whether LLM-based citation generation introduces bias—particularly for politically sensitive queries. In this work, we introduce AllSides-2024, a newly constructed dataset comprising the latest real-world news articles labeled with left- or right-leaning stances for investigating potential political bias in LLM-generated citations. Through systematic evaluations, we find that LLMs exhibit a consistent tendency to cite left-leaning sources at notably higher rates compared to traditional retrieval systems. Controlled experiments further reveal that this bias arises from a preference for media outlets identified as left-leaning, rather than for left-oriented content itself. Meanwhile, our findings show that while LLMs struggle to infer political bias from news content alone, they can almost perfectly recognize the political orientation of media outlets based on their names. These insights highlight the risk that, in the era of generative search engines, information exposure may be disproportionately shaped by specific media outlets, further shaping public perception and decision-making.

Media Source Matters More Than Content: Unveiling Political Bias in LLM-Generated Citations

Large language models (LLMs) exhibit social biases, prompting the development of various debiasing methods. However, debiasing methods may degrade the capabilities of LLMs. Previous research has evaluated the impact of bias mitigation primarily through tasks measuring general language understanding, which are often unrelated to social biases. In contrast, cultural commonsense is closely related to social biases, as both are rooted in social norms and values. The impact of bias mitigation on cultural commonsense in LLMs has not been well investigated. Considering this gap, we propose SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark designed to evaluate social biases and cultural commonsense in LLMs in a unified format. We evaluate several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense in LLMs. Our results reveal that the debiasing methods degrade the performance of the LLMs on the cultural commonsense task (up to 75% accuracy deterioration). These results highlight the importance of developing debiasing methods that consider the trade-off with cultural commonsense to improve fairness and utility of LLMs.

Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset

We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle—without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG—such as enforcing question diversity—enable weaker models to match GPT-4o. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.

GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.

Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

Spoken language from older adults often deviates from written norms due to omission, disordered syntax, constituent errors, and redundancy, limiting the usefulness of automatic transcripts in downstream tasks. We present COAS2W, a Chinese spoken-to-written corpus of 10, 004 utterances from older adults, each paired with a written version, fine-grained error labels, and four-sentence context. Unlike existing resources, COAS2W captures cross-sentence dependencies crucial for resolving ambiguities and recovering missing content. Fine-tuned lightweight open-source models on COAS2W outperform larger closed-source models. Context ablation shows the value of multi-sentence input, and normalization improves performance on downstream translation tasks. COAS2W supports the development of inclusive, context-aware language technologies for older speakers.

COAS2W: A Chinese Older-Adults Spoken-to-Written Transformation Corpus with Context Awareness

Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and data limitations. Previous work aligns LLMs with different cultures using survey data, primarily from the World Values Survey (WVS). However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for tasks like offensiveness classification. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To address these issues, we propose augmenting WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. Our experiments across multiple cultures show that this approach captures more enhances differentiated cultural values and improves downstream classification performances.

From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs

Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce STEER-BENCH, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, STEER-BENCH includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice questions with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using STEER-BENCH reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. STEER-BENCH is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.

STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

Introducing **MARK**, the **M**ulti-st**A**ge **R**easoning framewor**K** for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.

Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning

The performance of large language models (LLMs) is closely tied to their training data, which can include copyrighted material or private information, raising legal and ethical concerns. Additionally, LLMs face criticism for dataset contamination and internalizing biases. To address these issues, the Pre-Training Data Detection (PDD) task was proposed to identify if specific data was included in an LLM's pre-training corpus. However, existing PDD methods often rely on superficial features like prediction confidence and loss, resulting in mediocre performance. To improve this, we introduce NA-PDD, a novel algorithm analyzing differential neuron activation patterns between training and non-training data in LLMs. This is based on the observation that these data types activate different neurons during LLM inference. We also introduce CCNewsPDD, a temporally unbiased benchmark employing rigorous data transformations to ensure consistent time distributions between training and non-training data. Our experiments demonstrate that NA-PDD significantly outperforms existing methods across three benchmarks and multiple LLMs.

Downloads

Next from EMNLP 2025

Fair or Framed? Political Bias in News Articles Generated by LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES