China

We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall performance. Notably, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work ReasonIR, highlighting the quality of our synthesized training data. Our code, data, and retrieval models are publicly available.

EMNLP 2025

RaDeR: Reasoning-aware Dense Retrieval Models

retrieval-augmented search

reasoning-based retriever

math reasoning

mcts

information retrieval

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. While promising, direct application of ZO methods on edge devices is inefficient due to the high computational cost of multiple forward passes required for accurate gradient estimation, and their deployment has been largely unexplored in practice. We introduce MobiZO, a resource-efficient fine-tuning framework for LLMs specifically designed for edge devices. MobiZO combines three key innovations: (1) a parallelized randomized gradient estimator that employs both outer-loop and inner-loop parallelism to eliminate sequential forward passes, (2) a specialized Multi-Perturbed LoRA (MP-LoRA) module that enables efficient realization of both inner and outer loop parallelism, and (3) a seamless integration with ExecuTorch for on-device training, requiring no modifications to the runtime. Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications. Code available at: \url{anonymous.4open.science/r/MobiZO-DBC6}.

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

Selecting an optimal set of exemplars is critical for good performance of in-context learning. However, prior exemplar search methods narrowly optimize for predictive accuracy, critically neglecting model calibration—a key determinant of trustworthiness and safe deployment. In this paper, we formulate exemplar selection as a multi-objective optimization problem, explicitly targeting both the maximization of predictive accuracy and the minimization of expected calibration error. We solve this problem with a sample-efficient Combinatorial Bayesian Optimization algorithm (COM-BOM) to find the Pareto-front that optimally trade-offs the two objectives of accuracy and calibration. We evaluate COM-BOM on multiple tasks from un-saturated MMLU-pro benchmark and find that COM-BOM beats or matches the baselines in jointly optimizing the two objectives, while requiring a minimal number of LLM API calls.

COM-BOM: Bayesian Exemplar Search for Efficiently Exploring the Accuracy-Calibration Pareto Frontier

The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multi-modal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multi-modal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pre-training data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding.

What are Foundation Models Cooking in the Post-Soviet World?

Large Language Models (LLMs) are increasingly involved in high-stakes domains, yet how they reason about socially-sensitive decisions still remain underexplored. We present a large-scale audit of LLMs’ treatment of socioeconomic status (SES) in college admissions decisions using a novel dual-process framework inspired by cognitive science. Leveraging a synthetic dataset of 30,000 applicant profiles grounded in real-world correlations, we prompt 4 open-source LLMs (Qwen 2, Mistral v0.3, Gemma 2, Llama 3.1) under 2 modes: a fast, decision-only setup (System 1) and a slower, explanation-based setup (System 2). Results from 5 million prompts reveals that LLMs consistently favor low-SES applicants—even when controlling for academic performance—and that System 2 amplifies this tendency by explicitly invoking SES as compensatory justification, highlighting both their potential and volatility as decision-makers. We then propose DPAF, a dual-process audit framework to probe LLMs’ reasoning behaviors in sensitive applications.

‘Rich Dad, Poor Lad’: How do Large Language Models Contextualize Socioeconomic Factors in College Admission ?

Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the ``Ranking Blind Spot''—a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: **Decision Objective Hijacking**, which alters the evaluation goal in pairwise ranking systems, and **Decision Criteria Hijacking**, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to real-world examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks.

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer *that* in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and *that*-mentioning. However, we found that previous measures of information density based on matrix verbs' subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.

Uniform Information Density and Syntactic Reduction: Revisiting *that*-Mentioning in English Complement Clauses

Accurately measuring gender stereotypical bias in language models is a complex task with many hidden aspects. Current benchmarks have underestimated this multifaceted challenge and failed to capture the full extent of the problem. This paper examines the inconsistencies between intrinsic stereotype benchmarks. We propose that currently available benchmarks may each capture different aspects of gender stereotypes rather than providing truly comprehensive measurements. Using StereoSet and CrowS-Pairs as case studies, we investigated how data distribution affects benchmark results. By applying a framework from social psychology to balance the data of these benchmarks across various components of gender stereotypes, we demonstrated that even simple balancing techniques can significantly improve the correlation between different measurement approaches. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise scanned or digital documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models’ inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding.

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

Recent work has identified retrieval heads (Wu et al., 2025), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHead (Query-Focused Retrieval Head), an improved set of attention heads that significantly enhance retrieval from long contexts. We identify QRHead by aggregating attention scores with respect to the input query, using real-world tasks such as long-context QA. We further introduce QRRetriever, an efficient and effective retriever that uses the accumulated attention mass of QRHead as retrieval scores. We evaluate QRRetriever as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. We also use QRRetriever for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On long-context, multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. Further analysis shows that both the query-context attention scoring and task difficulty are crucial for identifying QRHead with strong downstream utility. Overall, our work contributes a general purpose retriever and offers interpretability insights into the long-context capabilities of LMs.

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length constraints. Such models are superior in length instructed evaluations, outperforming standard instruction following models such as GPT4, Llama 3 and Mixtral.

Downloads

Next from EMNLP 2025

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES