China

Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. We hence introduce HRDBench, a cognitively grounded benchmark for evaluating Human-centered Embodied Reasoning and Decision-making in MLLMs .HRDBench consists of 1,113 real-world situations paired with 6,126 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a holistic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the state-of-the-art commercial and open-source models on \benchmark, where we reveal distinct performance patterns and highlight significant challenges. Our in-depth analysis further offers insights into current model limitations and supports the development of MLLMs with more robust, context-aware, and socially adept embodied decision-making capabilities for real-world scenarios.

EMNLP 2025

VIVA+: Human-Centered Situational Decision-Making

human-centered decision making

multimodal reasoning

benchmarking

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Verbs occur in a particular syntactic environment (frame) along with their arguments. In this paper we introduce a new Hindi verb alternations benchmark to investigate whether pretrained large language models (LLMs) can infer the frame-selectional properties of Hindi verbs. Our benchmark consists of minimal pairs such as 'Tina cut the wood' / 'Tina disappeared the wood' that are annotated with human judgments. We expect that LLMs will assign lower probability to the unacceptable sentence. We create four variants of these alternations for Hindi to test knowledge of verb morphology and argument case-marking. Our results show that a masked monolingual model performs the best, while causal models fare poorly. We further test the quality of the predictions using a cloze-style sentence completion task. While the models appear to infer the right mapping between verbal morphology and valency in the acceptability task, they do not generate the right verbal morphology in the cloze task. The model completions also lack pragmatic and world knowledge. LLMs need to make both syntactic and semantic generalizations about verbal alternations, unlike other syntactic phenomena (like agreement). Our work points towards the need for greater cross-linguistic investigation of verbal alternations.

A Benchmark for Hindi Verb-Argument Structure Alternations

This work fosters research on the interaction between natural language use and gambling disorders. We have built a new Spanish corpus for screening standardized gambling symptoms. We employ search methods to find on-topic sentences, top-k pooling to form the assessment pools of sentences, and thorough annotation guidelines. The labeling task is challenging, given the need to identify topic relevance and explicit evidence about the symptoms. Additionally, we explore the use of state-of-the-art LLMs for annotation and compare different sentence search models.

Analyzing Gambling Addictions: A Spanish Corpus for Understanding Pathological Behavior

Using special tokens (e.g., gist, memory, or compressed tokens) to compress context information is a common practice for large language models (LLMs). However, existing approaches often neglect that position encodings inherently induce local inductive biases in models, causing the compression process to ignore holistic contextual dependencies. We propose **Enhanced Position Layout (EPL)**, a simple yet effective method that improves the context compression capability of LLMs by only adjusting position IDs, the numerical identifiers that specify token positions. EPL minimizes the distance between context tokens and their corresponding special tokens and at the same time maintains the sequence order in position IDs between context tokens, special tokens, and the subsequent tokens. Integrating EPL into our best performing context compression model results in 1.9 ROUGE-1 F1 improvement on out-of-domain question answering datasets in average. When extended to multimodal scenarios, EPL brings an average accuracy gain of 2.6 to vision compression LLMs.

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models

Multi-hop question answering is a challenging task that requires capturing information from different positions in multiple documents. Recently, several methods propose to enhance Large Language Models (LLMs) by incorporating structured knowledge, aiming to grasp key information for solving this task. Despite certain achievements, they still face the following challenges: 1) The neglect of text-based reasoning capabilities. 2) Information redundancy between text and triples. 3) Information loss during structured knowledge extraction. To solve the above challenges, in this paper, we propose Dynamic Combination of Structured Knowledge (DCSK), a novel framework for integrating text-based and triple-based paradigms. Following Occam's Razor, DCSK dynamically determine the necessity of structured knowledge by the designed multi-faceted evaluation, which systematically assess the correctness, clarity, and informativeness of text-based prediction. For questions that require structured knowledge, we develop an iterative fact refiner that screens for question-relevant triples, verifies their factual adequacy, and thereby effectively excludes irrelevant and redundant information. Furthermore, based on the verification results, we construct an adaptive knowledge reasoner that dynamically adjusts the need for text supplementation, thus mitigating the information deficiency in selected triples. Extensive experiments on three MHQA datasets demonstrate the efficiency and effectiveness of DCSK.

Following Occam’s Razor: Dynamic Combination of Structured Knowledge for Multi-Hop Question Answering using LLMs

The combination of pre-trained speech encoders with large language models has enabled the development of speech LLMs that can handle a wide range of spoken language processing tasks. While these models are powerful and flexible, this very flexibility may make them more vulnerable to adversarial attacks. To examine the extent of this problem, in this work we investigate universal acoustic adversarial attacks on speech LLMs. Here a fixed, universal, adversarial audio segment is prepended to the original input audio. We initially investigate attacks that cause the model to either produce no output or to perform a modified task overriding the original prompt. We then extend the nature of the attack to be selective so that it activates only when specific input attributes, such as a speaker gender or spoken language, are present. Inputs without the targeted attribute should be unaffected, allowing fine-grained control over the model outputs. Our findings reveal critical vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar speech LLMs may be susceptible to universal adversarial attacks. This highlights the need for more robust training strategies and improved resistance to adversarial attacks.

Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs

The increasing size and complexity of large language models (LLMs) raise concerns about their ability to “cheat” on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While existing methods detect such leakage, they fail to address the long-term challenge of mitigating it. In this paper, we introduce LastingBench, a novel approach to reinforce and safeguard existing benchmarks against knowledge leakage. Our method involves identifying leakage points through perturbation-based detection, followed by counterfactual rewriting to disrupt memorization while preserving the benchmark's original evaluative intent. We demonstrate that our approach significantly reduces memorization effects in long-context QA benchmarks, providing a more accurate assessment of model reasoning and generalization abilities. Our experiments show that LastingBench not only uncovers substantial leakage in benchmarks like HotpotQA but also yields a more reliable evaluation of state-of-the-art models, ensuring that benchmarks remain effective and resilient over time.

LastingBench: Defend Benchmarks Against Knowledge Leakage

Machine translation (MT) research addressing gender inclusivity has gained attention for promoting non-exclusionary language representing all genders. However, existing resources are limited to short sources, most often single sentences, or single gender-fair formulation types, leaving questions about MT models' ability to use context and diverse inclusive forms. We introduce Glitter, a new English-German benchmark featuring extended passages with professional translations implementing three gender-fair alternatives: neutral rephrasing, typographical solutions (gender star), and neologistic forms (-ens endings). Our experiments reveal significant limitations in state-of-the-art language models, which default to masculine generics, struggle to interpret explicit gender cues in context, and rarely produce gender-fair translations. Through systematic prompting analysis designed to elicit fair language, we demonstrate that current models lack a fundamental understanding of source gender phenomena, failing to implement inclusive forms even when explicitly instructed. Glitter establishes a challenging benchmark, advancing research in gender-fair English-German MT. It highlights substantial room for improvement even among leading models and can serve to guide development of future MT models capable of accurately representing gender diversity.

Glitter: A Multi-Sentence, Multi-Reference Benchmark for Gender-Fair German Machine Translation

Temporal knowledge graph (TKG) reasoning, a central task in temporal knowledge representation, focuses on predicting future facts by leveraging historical temporal contexts. However, current approaches face two major challenges: limited generalization to unseen facts and insufficient interpretability of reasoning processes. To address these challenges, this paper proposes the **D**enoising **L**ogic-based **T**emporal **K**nowledge **G**raph (**DLTKG**) framework, which employs a denoising diffusion process to complete reasoning tasks by introducing a noise source and a historical conditionguiding mechanism. Specifically, DLTKG constructs fuzzy entity representations by treating historical facts as noise sources, thereby enhancing the semantic associations between entities and the generalization ability for unseen facts. Additionally, a condition-based guidance mechanism, rooted in the relationship evolutionary paths, is designed to improve the interpretability of the reasoning process. Furthermore, we introduce a fine-tuning strategy that optimizes the denoising process by leveraging shortest path information between head entity and candidate entities. Experimental results on three benchmark datasets demonstrate that DLTKG outperforms state-of-the-art methods across multiple evaluation metrics. Our code is available at: https://anonymous.4open.science/r/DLTKG-7CCB

DLTKG: Denoising Logic-based Temporal Knowledge Graph Reasoning

We introduce MANTA, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.

MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets

Knowledge-intensive queries require accurate answers that are explicitly grounded in retrieved evidence. However, existing retrieval-augmented generation (RAG) approaches often struggle with query complexity, suffer from propagated reasoning errors, or rely on incomplete or noisy retrieval, limiting their effectiveness. To address these limitations, we introduce UniRAG, a unified RAG framework that integrates entity-grounded query decomposition, break-down reasoning, and iterative query rewriting. Specifically, UniRAG decomposes queries into semantically coherent sub-queries, explicitly verifies retrieved sub-facts through a dedicated reasoning module, and adaptively refines queries based on identified knowledge gaps, significantly improving answer completeness and reliability. Extensive benchmark evaluations on complex question-answering datasets, including multi-hop HotPotQA and 2WikiMultihopQA, biomedical MedMCQA and MedQA, and fact-verification FEVER and SciFact, demonstrate that UniRAG consistently achieves performance improvements across various state-of-the-art LLMs, such as LLaMA-3.1-8B, GPT-3.5-Turbo, and Gemini-1.5-Flash.

Downloads

Next from EMNLP 2025

A Benchmark for Hindi Verb-Argument Structure Alternations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES