China

Machine translation (MT) research addressing gender inclusivity has gained attention for promoting non-exclusionary language representing all genders. However, existing resources are limited to short sources, most often single sentences, or single gender-fair formulation types, leaving questions about MT models&#39; ability to use context and diverse inclusive forms. We introduce Glitter, a new English-German benchmark featuring extended passages with professional translations implementing three gender-fair alternatives: neutral rephrasing, typographical solutions (gender star), and neologistic forms (-ens endings). Our experiments reveal significant limitations in state-of-the-art language models, which default to masculine generics, struggle to interpret explicit gender cues in context, and rarely produce gender-fair translations. Through systematic prompting analysis designed to elicit fair language, we demonstrate that current models lack a fundamental understanding of source gender phenomena, failing to implement inclusive forms even when explicitly instructed. Glitter establishes a challenging benchmark, advancing research in gender-fair English-German MT. It highlights substantial room for improvement even among leading models and can serve to guide development of future MT models capable of accurately representing gender diversity.

EMNLP 2025

Glitter: A Multi-Sentence, Multi-Reference Benchmark for Gender-Fair German Machine Translation

gender-fair language

german

machine translation

Machine translation (MT) research addressing gender inclusivity has gained attention for promoting non-exclusionary language representing all genders. However, existing resources are limited to short sources, most often single sentences, or single gender-fair formulation types, leaving questions about MT models' ability to use context and diverse inclusive forms. We introduce Glitter, a new English-German benchmark featuring extended passages with professional translations implementing three gender-fair alternatives: neutral rephrasing, typographical solutions (gender star), and neologistic forms (-ens endings). Our experiments reveal significant limitations in state-of-the-art language models, which default to masculine generics, struggle to interpret explicit gender cues in context, and rarely produce gender-fair translations. Through systematic prompting analysis designed to elicit fair language, we demonstrate that current models lack a fundamental understanding of source gender phenomena, failing to implement inclusive forms even when explicitly instructed. Glitter establishes a challenging benchmark, advancing research in gender-fair English-German MT. It highlights substantial room for improvement even among leading models and can serve to guide development of future MT models capable of accurately representing gender diversity.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Temporal knowledge graph (TKG) reasoning, a central task in temporal knowledge representation, focuses on predicting future facts by leveraging historical temporal contexts. However, current approaches face two major challenges: limited generalization to unseen facts and insufficient interpretability of reasoning processes. To address these challenges, this paper proposes the **D**enoising **L**ogic-based **T**emporal **K**nowledge **G**raph (**DLTKG**) framework, which employs a denoising diffusion process to complete reasoning tasks by introducing a noise source and a historical conditionguiding mechanism. Specifically, DLTKG constructs fuzzy entity representations by treating historical facts as noise sources, thereby enhancing the semantic associations between entities and the generalization ability for unseen facts. Additionally, a condition-based guidance mechanism, rooted in the relationship evolutionary paths, is designed to improve the interpretability of the reasoning process. Furthermore, we introduce a fine-tuning strategy that optimizes the denoising process by leveraging shortest path information between head entity and candidate entities. Experimental results on three benchmark datasets demonstrate that DLTKG outperforms state-of-the-art methods across multiple evaluation metrics. Our code is available at: https://anonymous.4open.science/r/DLTKG-7CCB

DLTKG: Denoising Logic-based Temporal Knowledge Graph Reasoning

We introduce MANTA, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.

MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets

Knowledge-intensive queries require accurate answers that are explicitly grounded in retrieved evidence. However, existing retrieval-augmented generation (RAG) approaches often struggle with query complexity, suffer from propagated reasoning errors, or rely on incomplete or noisy retrieval, limiting their effectiveness. To address these limitations, we introduce UniRAG, a unified RAG framework that integrates entity-grounded query decomposition, break-down reasoning, and iterative query rewriting. Specifically, UniRAG decomposes queries into semantically coherent sub-queries, explicitly verifies retrieved sub-facts through a dedicated reasoning module, and adaptively refines queries based on identified knowledge gaps, significantly improving answer completeness and reliability. Extensive benchmark evaluations on complex question-answering datasets, including multi-hop HotPotQA and 2WikiMultihopQA, biomedical MedMCQA and MedQA, and fact-verification FEVER and SciFact, demonstrate that UniRAG consistently achieves performance improvements across various state-of-the-art LLMs, such as LLaMA-3.1-8B, GPT-3.5-Turbo, and Gemini-1.5-Flash.

UniRAG: A Unified RAG Framework for Knowledge-Intensive Queries with Decomposition, Break-Down Reasoning, and Iterative Rewriting

Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks—boosting sentiment analysis from 35.3 \% to 87.5 \% and paraphrase detection from 56.1 \% to 87.0 \%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Leveraging the autonomous decision-making capabilities of large language models (LLMs) has demonstrated superior performance in reasoning tasks. However, despite the success of iterative or agentic retrieval-augmented generation (RAG) techniques, these methods are often constrained to a single solution space when confronted with complex problems. In this paper, we propose a novel thinking pattern in RAG that integrates autonomous strategic planning with efficient reasoning actions, significantly activating intrinsic reasoning capabilities and expanding the solution space of specific tasks via Monte Carlo Tree Search (MCTS), which we refer to as AirRAG. Specifically, our approach designs five fundamental reasoning actions, which are expanded to a broad tree-based reasoning space using MCTS. The approach also incorporates self-consistency verification to explore potential reasoning paths and inference scaling law. Additionally, computationally optimal strategies are employed to allocate more inference resources to key actions, thereby enhancing overall performance. Experimental results demonstrate the effectiveness of AirRAG, showing significant performance gains on complex question-answering datasets. Furthermore, AirRAG is flexible and lightweight, making it easy to integrate with other advanced technologies and models.

AirRAG: Autonomous Strategic Planning and Reasoning Steer Retrieval Augmented Generation

The prevailing "trivia-centered paradigm" for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly "neutral" evaluation settings. In this position paper, we argue for \textbf{intentionally cultural evaluation}: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don't know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.

Culture is Everywhere: A Call for Intentionally Cultural Evaluation

Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most effective one through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in reasoning completeness and robustness. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference. All code and data will be released on GitHub.

ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation

Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present LC-Eval, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.

LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

Binary relations, such as equality, are basic mathematical concepts that appear, implicitly or explicitly, in most benchmarks for Large Language Models (LLM). A recent trend in the literature is benchmarking LLMs on out-of-context learning, where the data is not presented in the prompt, but only during the model's training. However, existing works mostly focus on higher-order tasks, making it hard to interpret success or failure. In this work, we study how well can LLMs reason out-of-context on binary relations by only learning the representations of newly introduced tokens. Our experiments focus on equality (=), inequality (<), and inclusion (subset) and the properties they satisfy, such as reflexivity, symmetry, transitivity, and logical complexity (e.g., the number of reasoning "hops"). We show that LLMs achieve better than random accuracy, but are still far from perfect, even on relatively simple reasoning tasks involving binary relations. We analyse the learned representations and show that LLMs encode useful information directly, arranging the embeddings according to the task.

Out-of-Context Reasoning in Large Language Models

Recent advances in large language model (LLM) fine‑tuning have shown that incorporating high‑quality reasoning traces into training data can markedly improve downstream performance. However, existing approaches often depend on expensive manual annotations or auxiliary models, and fail to adapt to the unique limitations of smaller “weak” LLMs. To address these gaps, we introduce Weak2Wise, a fully automated, lightweight framework for synthesizing high‑quality, weak-LLM-friendly reasoning traces. Starting from a QA dataset, Weak2Wise filters out the samples that can already be correctly answered by the weak LLM, gathers diverse candidate reasoning traces from multiple strong LLMs, and leverages our Step‑Mask scoring to rank and truncate the most guidance‑effective traces. These reasoning traces are then used for fine‑tuning, yielding substantial improvements in the weak LLM’s reasoning abilities. The name Weak2Wise has two meanings: using a “weak” LLM to select the "wisest" reasoning traces generated by stronger LLMs, and fine‑tuning the same weak LLM on these reasoning traces to become “wiser”. We further use Weak2Wise to build GR-1K, a 1,000‑sample math and science QA‑reasoning dataset optimized for weak LLMs, and fine‑tune Qwen2.5‑7B on it to create GR‑7B, which achieves superior performance on AIME2024, MATH‑500, and GPQA Diamond benchmarks.

Downloads

Next from EMNLP 2025

DLTKG: Denoising Logic-based Temporal Knowledge Graph Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES