China

Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements. An effective multimodal retriever needs to handle two main challenges: (1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and (2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents. To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations. First, we introduce a layered component graph, explicitly representing multimodal information at two layers—each representing coarse and fine granularity—facilitating efficient yet precise reasoning. Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction. Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on four out of five benchmarks, notably without additional fine-tuning.

EMNLP 2025

LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval

multimodal

retrieval

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Event Detection (ED) -- the task of identifying event mentions from natural language text -- is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to specialized domains, they struggle with label noise, where annotations are incorrect, and domain drift, characterized by a distributional mismatch between generated sentences and the target domain. To address these issues, we introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list using corpus-level statistics to mitigate domain drift. Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions, ensuring high annotation quality. Experimentation on three diverse domain ED datasets reveals how SNaRe outperforms the best baseline, achieving average F1 gains of 3-7% in the zero-shot/few-shot settings and 4-20% F1 improvement for multilingual generation. Analyzing the generated trigger hit rate and human evaluation substantiates SNaRe's stronger annotation quality and reduced domain drift.

SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.

Table-R1: Inference-Time Scaling for Table Reasoning Tasks

We present LimRank, a reranking model that excels in reasoning-intensive retrieval tasks, fine-tuned with only 20K examples—less than 5% of the data typically used in prior work. Unlike existing approaches that rely on large-scale fine-tuning or pretraining for LLM-based reranking, we show that modern LLMs can be effectively adapted with minimal, high-quality supervision. To enable this, we design LimRank-Synthesizer, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. We evaluate LimRank on two challenging information retrieval benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and Follow-IR for instruction-following retrieval. The experimental results demonstrate that LimRank achieves state-of-the-art performance among all 7B-level rerankers. Additional experiments on downstream tasks, including scientific literature search and retrieval-augmented generation, further establish LimRank as a practical and strong plug-and-play reranking model for real-world IR systems.

LimRank: Less is More for Reasoning-Intensive Information Reranking

Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches develop separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%\footnote{Our source code is available at \url{https://anonymous.4open.science/r/KGC-SAT}.

Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning

While large language models (LLMs) have shown strong capabilities across diverse domains, their application to code vulnerability detection holds great potential for identifying security flaws and improving software safety. In this paper, we propose a sequential multi-stage approach via confidence- and collaboration-based decision making (ConfColl). The system adopts a three-stage sequential classification framework, proceeding through a single agent, retrieval-augmented generation (RAG) with external examples, and multi-agent reasoning enhanced with RAG. The decision process selects among these strategies to balance performance and cost, with the process terminating at any stage where a high-certainty prediction is achieved. Experiments on a benchmark dataset and a low-resource language demonstrate the effectiveness of our framework in enhancing code vulnerability detection performance.

A Sequential Multi-Stage Approach for Code Vulnerability Detection via Confidence- and Collaboration-based Decision Making

Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale LLMs like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce Reinforced Query Reasoner (RQR), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. Our approach frames query reformulation as a reinforcement learning problem and employs a novel semi-rule-based reward function. This enables smaller language models, e.g., Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve reasoning performance rivaling large-scale LLMs without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that, with BM25 as retrievers, both RQR-7B and RQR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment. All code and dataset will be publicly released.

Reinforced Query Reasoners for Reasoning-intensive Retrieval Tasks

Large language models (LLMs) often fail to capture semantic changes in queries due to negation, and generate incorrect responses. Negation frequently exists in the real world and is useful for understanding the opposite or absence of a statement, so it is an essential element in logical reasoning. Previous studies have explored LLMs' ability to capture negations separately from their ability to properly ground knowledge for positive queries. However, this perspective is limited in that it cannot clearly distinguish whether the cause of incorrect responses is the logical incoherence caused by negations or the lack of grounding ability for the given context. To address this issue, we focus on the phenomenon of the model failing to capture semantic contradictions in negated queries despite its accurate understanding of knowledge about positive queries. We term this phenomenon negation blindness on the query. We propose a verification framework that includes task design and measurement methods to verify this issue. In detail, we establish two criteria for systematic task design-i) complexity and ii) constrainedness-and devise four verification tasks accordingly. Moreover, we analyze the results extensively and provide insights into problem alleviation feasibility through experiments on various approaches

Semantic Inversion, Identical Replies: Revisiting Negation Blindness in Large Language Models

Addressing the challenges in QA for specific technical domains requires identifying relevant portions of extensive documents and generating answers based on this focused content. Traditional pre-trained LLMs often struggle with domain-specific terminology, while fine-tuned LLMs demand substantial computational resources. To overcome these limitations, we propose TIDES, Technical Information Distillation and Extraction System. TIDES is a training-free approach that combines traditional TF-IDF techniques with prompt-based LLMs in a hybrid process, effectively addressing complex technical questions. It uses TF-IDF to identify and prioritize domain-specific terms that are less common in other documents and LLMs to refine the candidate pool by focusing on the most relevant segments in documents through multiple stages. Our approach improves the precision and efficiency of QA systems in technical contexts without LLM retraining.

TIDES: Technical Information Discovery and Extraction System

Equipped with the capability to call functions, modern LLM agents can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLM agents but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLM agents tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We find that due to the next-token prediction training objective, LLM agents tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed, which prompts LLM agents to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLM agents' performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the Ask-when-Needed significantly outperforms existing frameworks for tool learning in the Noisy ToolBench. We will release all related code and datasets to support future research.

Learning to Ask: When LLM Agents Meet Unclear Instruction

Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs.

Downloads

Next from EMNLP 2025

SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES