China

The rapid advancement of large language models (LLMs) significantly enhances long-context Retrieval-Augmented Generation (RAG), yet existing benchmarks focus primarily on English. This leaves low-resource languages without comprehensive evaluation frameworks, limiting their progress in retrieval-based tasks. To bridge this gap, we introduce Ko-LongRAG, the first Korean long-context RAG benchmark. Unlike conventional benchmarks that depend on external retrievers, Ko-LongRAG adopts a retrieval-free approach designed around Specialized Content Knowledge (SCK), enabling controlled and high-quality QA pair generation without the need for an extensive retrieval infrastructure. %By clustering domain-specific documents and generating intra-cluster question-answer pairs, Ko-LongRAG effectively simulates retrieval-based reasoning while maintaining high contextual fidelity. Our evaluation shows that o1 model achieves the highest performance among proprietary models, while EXAONE 3.5 leads among open-sourced models. Additionally, various findings confirm Ko-LongRAG as a reliable benchmark for assessing Korean long-context RAG capabilities and highlight its potential for advancing multilingual RAG research. The dataset and source code will be released publicly.

EMNLP 2025

Ko-LongRAG: A Korean Long-Context RAG Benchmark Built with a Retrieval-Free Approach

korean long-context rag benchmarks

retrieval-free

low-resource language

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

One challenge of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3% boost in MMLU accuracy and a 15.5% boost in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it more accessible.

KurTail : Kurtosis-based LLM Quantization

Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. We hence introduce HRDBench, a cognitively grounded benchmark for evaluating Human-centered Embodied Reasoning and Decision-making in MLLMs .HRDBench consists of 1,113 real-world situations paired with 6,126 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a holistic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the state-of-the-art commercial and open-source models on \benchmark, where we reveal distinct performance patterns and highlight significant challenges. Our in-depth analysis further offers insights into current model limitations and supports the development of MLLMs with more robust, context-aware, and socially adept embodied decision-making capabilities for real-world scenarios.

VIVA+: Human-Centered Situational Decision-Making

Verbs occur in a particular syntactic environment (frame) along with their arguments. In this paper we introduce a new Hindi verb alternations benchmark to investigate whether pretrained large language models (LLMs) can infer the frame-selectional properties of Hindi verbs. Our benchmark consists of minimal pairs such as 'Tina cut the wood' / 'Tina disappeared the wood' that are annotated with human judgments. We expect that LLMs will assign lower probability to the unacceptable sentence. We create four variants of these alternations for Hindi to test knowledge of verb morphology and argument case-marking. Our results show that a masked monolingual model performs the best, while causal models fare poorly. We further test the quality of the predictions using a cloze-style sentence completion task. While the models appear to infer the right mapping between verbal morphology and valency in the acceptability task, they do not generate the right verbal morphology in the cloze task. The model completions also lack pragmatic and world knowledge. LLMs need to make both syntactic and semantic generalizations about verbal alternations, unlike other syntactic phenomena (like agreement). Our work points towards the need for greater cross-linguistic investigation of verbal alternations.

A Benchmark for Hindi Verb-Argument Structure Alternations

This work fosters research on the interaction between natural language use and gambling disorders. We have built a new Spanish corpus for screening standardized gambling symptoms. We employ search methods to find on-topic sentences, top-k pooling to form the assessment pools of sentences, and thorough annotation guidelines. The labeling task is challenging, given the need to identify topic relevance and explicit evidence about the symptoms. Additionally, we explore the use of state-of-the-art LLMs for annotation and compare different sentence search models.

Analyzing Gambling Addictions: A Spanish Corpus for Understanding Pathological Behavior

Using special tokens (e.g., gist, memory, or compressed tokens) to compress context information is a common practice for large language models (LLMs). However, existing approaches often neglect that position encodings inherently induce local inductive biases in models, causing the compression process to ignore holistic contextual dependencies. We propose **Enhanced Position Layout (EPL)**, a simple yet effective method that improves the context compression capability of LLMs by only adjusting position IDs, the numerical identifiers that specify token positions. EPL minimizes the distance between context tokens and their corresponding special tokens and at the same time maintains the sequence order in position IDs between context tokens, special tokens, and the subsequent tokens. Integrating EPL into our best performing context compression model results in 1.9 ROUGE-1 F1 improvement on out-of-domain question answering datasets in average. When extended to multimodal scenarios, EPL brings an average accuracy gain of 2.6 to vision compression LLMs.

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models

Multi-hop question answering is a challenging task that requires capturing information from different positions in multiple documents. Recently, several methods propose to enhance Large Language Models (LLMs) by incorporating structured knowledge, aiming to grasp key information for solving this task. Despite certain achievements, they still face the following challenges: 1) The neglect of text-based reasoning capabilities. 2) Information redundancy between text and triples. 3) Information loss during structured knowledge extraction. To solve the above challenges, in this paper, we propose Dynamic Combination of Structured Knowledge (DCSK), a novel framework for integrating text-based and triple-based paradigms. Following Occam's Razor, DCSK dynamically determine the necessity of structured knowledge by the designed multi-faceted evaluation, which systematically assess the correctness, clarity, and informativeness of text-based prediction. For questions that require structured knowledge, we develop an iterative fact refiner that screens for question-relevant triples, verifies their factual adequacy, and thereby effectively excludes irrelevant and redundant information. Furthermore, based on the verification results, we construct an adaptive knowledge reasoner that dynamically adjusts the need for text supplementation, thus mitigating the information deficiency in selected triples. Extensive experiments on three MHQA datasets demonstrate the efficiency and effectiveness of DCSK.

Following Occam’s Razor: Dynamic Combination of Structured Knowledge for Multi-Hop Question Answering using LLMs

The combination of pre-trained speech encoders with large language models has enabled the development of speech LLMs that can handle a wide range of spoken language processing tasks. While these models are powerful and flexible, this very flexibility may make them more vulnerable to adversarial attacks. To examine the extent of this problem, in this work we investigate universal acoustic adversarial attacks on speech LLMs. Here a fixed, universal, adversarial audio segment is prepended to the original input audio. We initially investigate attacks that cause the model to either produce no output or to perform a modified task overriding the original prompt. We then extend the nature of the attack to be selective so that it activates only when specific input attributes, such as a speaker gender or spoken language, are present. Inputs without the targeted attribute should be unaffected, allowing fine-grained control over the model outputs. Our findings reveal critical vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar speech LLMs may be susceptible to universal adversarial attacks. This highlights the need for more robust training strategies and improved resistance to adversarial attacks.

Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs

The increasing size and complexity of large language models (LLMs) raise concerns about their ability to “cheat” on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While existing methods detect such leakage, they fail to address the long-term challenge of mitigating it. In this paper, we introduce LastingBench, a novel approach to reinforce and safeguard existing benchmarks against knowledge leakage. Our method involves identifying leakage points through perturbation-based detection, followed by counterfactual rewriting to disrupt memorization while preserving the benchmark's original evaluative intent. We demonstrate that our approach significantly reduces memorization effects in long-context QA benchmarks, providing a more accurate assessment of model reasoning and generalization abilities. Our experiments show that LastingBench not only uncovers substantial leakage in benchmarks like HotpotQA but also yields a more reliable evaluation of state-of-the-art models, ensuring that benchmarks remain effective and resilient over time.

LastingBench: Defend Benchmarks Against Knowledge Leakage

Machine translation (MT) research addressing gender inclusivity has gained attention for promoting non-exclusionary language representing all genders. However, existing resources are limited to short sources, most often single sentences, or single gender-fair formulation types, leaving questions about MT models' ability to use context and diverse inclusive forms. We introduce Glitter, a new English-German benchmark featuring extended passages with professional translations implementing three gender-fair alternatives: neutral rephrasing, typographical solutions (gender star), and neologistic forms (-ens endings). Our experiments reveal significant limitations in state-of-the-art language models, which default to masculine generics, struggle to interpret explicit gender cues in context, and rarely produce gender-fair translations. Through systematic prompting analysis designed to elicit fair language, we demonstrate that current models lack a fundamental understanding of source gender phenomena, failing to implement inclusive forms even when explicitly instructed. Glitter establishes a challenging benchmark, advancing research in gender-fair English-German MT. It highlights substantial room for improvement even among leading models and can serve to guide development of future MT models capable of accurately representing gender diversity.

Glitter: A Multi-Sentence, Multi-Reference Benchmark for Gender-Fair German Machine Translation

Temporal knowledge graph (TKG) reasoning, a central task in temporal knowledge representation, focuses on predicting future facts by leveraging historical temporal contexts. However, current approaches face two major challenges: limited generalization to unseen facts and insufficient interpretability of reasoning processes. To address these challenges, this paper proposes the **D**enoising **L**ogic-based **T**emporal **K**nowledge **G**raph (**DLTKG**) framework, which employs a denoising diffusion process to complete reasoning tasks by introducing a noise source and a historical conditionguiding mechanism. Specifically, DLTKG constructs fuzzy entity representations by treating historical facts as noise sources, thereby enhancing the semantic associations between entities and the generalization ability for unseen facts. Additionally, a condition-based guidance mechanism, rooted in the relationship evolutionary paths, is designed to improve the interpretability of the reasoning process. Furthermore, we introduce a fine-tuning strategy that optimizes the denoising process by leveraging shortest path information between head entity and candidate entities. Experimental results on three benchmark datasets demonstrate that DLTKG outperforms state-of-the-art methods across multiple evaluation metrics. Our code is available at: https://anonymous.4open.science/r/DLTKG-7CCB

Downloads

Next from EMNLP 2025

KurTail : Kurtosis-based LLM Quantization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES