China

Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene–task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40times higher than base prompts. Evaluating 11 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies — highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.

EMNLP 2025

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

embodied agents

hallucinations

large language models

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati -- an unwritten language -- that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.

Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

Like most languages, sign languages evolve over time. It is important that sign language dictionaries' vocabularies are updated over time to reflect these changes, such as by adding new signs. However, most dictionary retrieval methods based upon machine learning models only work with fixed vocabularies, and it is unclear how they might support dictionary expansion without retraining. In this work, we explore the feasibility of dictionary expansion for sign language dictionaries using a simple representation-based method. We explore a variety of dictionary expansion scenarios, e.g., varying number of signs added as well as amount of data for these newly added signs. Through our results, we show how performance varies significantly across different scenarios, many of which are reflective of real-world data challenges. Our findings offer implications for the development & maintenance of video-based sign language dictionaries, and highlight directions for future research on dictionary expansion.

Investigating Dictionary Expansion for Video-based Sign Language Dictionaries

Vision-Language Models have shown impressive capabilities in image understanding tasks but their ability to interpret diverse data visualizations with varying information density remains underexplored. This paper presents a systematic evaluation of VLMs on process visualizations - a class of visualization requiring both visual literacy and sequential reasoning. We evaluate models across three process visualization techniques on multiple tasks. We evaluate extraction performance on charts, and construct a benchmark dataset of expert-validated QA pairs. Our findings reveal that while frontier models achieve strong performance on extraction and single hop reasoning (>80% accuracy), they struggle on multi-hop reasoning tasks. Open-source models struggle with all reasoning tasks. We also perform an analysis of chart information density showing performance decreases for models as information complexity increases.

ProcVQA: Benchmarking the Effects of Structural Properties in Mined Process Visualizations on Vision–Language Model Performance

Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.

FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker's linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Who Speaks Matters: Analysing the Influence of the Speaker's Linguistic Identity on Hate Classification

Large Language Models' (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs’ cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model's empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behavior patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behavior.

Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy

As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the usefulness of these models for such tasks remains challenging due to the cost of fine-grained annotation for long-form texts and the data contamination concerns inherent in using public-domain literature. Current embedding similarity datasets are not suitable for evaluating literary-domain tasks because of a focus on coarse-grained similarity and primarily on very short text. We assemble and release a dataset, FicSim, of long-form, recently written fiction, alongside scores along 11 axes of similarity informed by author-produced metadata and validated by digital humanities scholars. We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories that would be useful for computational literary studies tasks. Throughout our data-collection process, we prioritize author agency and rely on continual, informed author consent.

FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction

Despite growing interest in explainable NLP, it remains unclear how explanation strategies shape user behavior in tasks like authorship identification, where relevant textual features may be difficult for lay users to pinpoint. To support their analysis of text style, we consider two explanation types: example-based style rewrites and feature-based rationales, generated using a LLM-based pipeline. We measured how explanations impact user behavior in a controlled study (n=95) where participants completed authorship identification tasks with four levels of AI assistance. While no explanation type improved overall task accuracy, fine-grained reliance patterns \citep{schemmer_2023} revealed that rewrites supported appropriate reliance, whereas presenting both explanation types increased AI overreliance and participant underreliance. Participants exhibiting appropriate reliance had focused explanation needs, contrasting with the diffused preferences of those who overrelied or incorrectly self-relied. These findings highlight the need for adaptive explanation systems that tailor support based on specific user reliance behaviors.

Who’s the Author? How Explanations Impact User Reliance in AI-Assisted Authorship Attribution

The capacity to attribute mental states like beliefs, desires, and intentions to oneself and others, known as Theory of Mind (ToM), is fundamental to human social intelligence. As Large Language Models (LLMs) are increasingly integrated into complex interactive systems, developing their ToM capabilities is crucial. Such capabilities enable LLMs to understand and predict human behavior, leading to more intuitive and productive interactions. However, current models often struggle with sophisticated reasoning about others' perspectives. In this work, we propose ``Agentic-ToM'', showing that guiding LLMs by embedding psychologically-grounded functions for capabilities such as 'perspective taking' and mental state tracking markedly improves their proficiency in ToM tasks. We evaluate the approach on three diverse ToM datasets and show that this method significantly outperforms baselines across all tasks without requiring task-specific modifications.

Agentic-ToM: Cognition-Inspired Agentic Processing For Enhancing Theory of Mind Reasoning

Using empirical methods from NLP has led to tremendous progress in the social sciences and humanities. I address how leveraging innovative NLP techniques has helped to discover and validate hypotheses about the functioning and patterns of complex societal systems, jointly considering the content and structure of social and socio-technical interactions, and scaling up the analysis of human-centered, qualitative, and multi-modal data. Knowledge from the social sciences has in turn informed and advanced data annotation, model building, and result explanation in NLP. At the same time, the choices we make when studying and predicting the structure and behavior of groups matter. I discuss how differences in research design, epistemology, and culturally contextualized algorithms impact knowledge discovery, theory development, practical applications, and policies. I conclude with implications of work at the nexus of NLP and the social sciences on innovation, validation, and bridging differences in scales.

Downloads

Next from EMNLP 2025

Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES