China

Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings from DPR models into distinct, interpretable latent concepts. We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extracted latent concepts as indexing units. CL-SR effectively combines the semantic expressiveness of dense embeddings with the transparency and efficiency of sparse representations. We show that CL-SR achieves high computational and storage efficiency while maintaining robust performance across vocabulary and semantic mismatches.

EMNLP 2025

Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval

free-text/natural language explanations

document representation

passage retrieval

dense retrieval

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Mobile task automation is an emerging technology that leverages AI to automatically execute routine tasks by users' commands on mobile devices like Android, thus enhancing efficiency and productivity. While large language models (LLMs) excel at general mobile tasks through training on massive datasets, they struggle with app-specific workflows. To solve this problem, we designed UI Map, a structured representation of target app's UI information. We further propose a UI Map-guided LLM-based approach UICompass to automate mobile tasks. Specifically, UICompass first leverages static analysis and LLMs to automatically build UI Map from either source codes of apps or byte codes (i.e., APK packages). During task execution, UICompass mines the task-relevant information from UI Map to feed into the LLMs, generate a planned paths, and adaptively adjust the path based on the actual app state and action history. Experimental results demonstrate that UICompass achieves a 15.87% higher task executing success rate than SOTA approaches. Even when only APK is available, UICompass maintains superior performance, demonstrating its applicability to closed-source apps.

UICOMPASS: UI Map Guided Mobile Task Automation via Adaptive Action Generation

Conversational search systems require effective handling of context-dependent queries that often contain ambiguity, omission, and coreference. Conversational Query Reformulation (CQR) addresses this challenge by transforming these queries into self-contained forms suitable for off-the-shelf retrievers. However, existing CQR approaches suffer from two critical constraints: high dependency on costly external supervision from human annotations or large language models, and insufficient alignment between the rewriting model and downstream retrievers. We present ConvSearch-R1, the first self-driven framework that completely eliminates dependency on external rewrite supervision by leveraging reinforcement learning to optimize reformulation directly through retrieval signals. Our novel two-stage approach combines Self-Driven Policy Warm-Up to address the cold-start problem through retrieval-guided self-distillation, followed by Retrieval-Guided Reinforcement Learning with a specially designed rank-incentive reward shaping mechanism that addresses the sparsity issue in conventional retrieval metrics. Extensive experiments on TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly outperforms previous state-of-the-art methods, achieving over 10% improvement on the challenging TopiOCQA dataset while using smaller 3B parameter models without any external supervision.

ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

We propose a compositional entity modeling framework for requirement extraction from online job advertisements (OJAs), representing complex, tree-like structures that connect atomic entities via typed relations. Based on this schema, we introduce GOJA, a manually annotated dataset of 500 German job ads that captures roles, tools, experience levels, attitudes, and their functional context. We report strong inter-annotator agreement and benchmark transformer models, demonstrating the feasibility of learning this structure. A focused case study on AI-related requirements illustrates the analytical value of our approach for labor market research.

Improving Online Job Advertisement Analysis via Compositional Entity Extraction

LLMs with in-context learning (ICL) obtain remarkable performance but are sensitive to the quality of ICL examples. Prior works on ICL example selection explored unsupervised heuristic methods and supervised LLM-based methods, but they typically focus on the selection of individual examples and ignore correlations among examples. Researchers use the determinantal point process (DPP) to model negative correlations among examples to select diverse examples. However, the DPP fails to model positive correlations among examples, while ICL still requires the positive correlations of examples to ensure the consistency of examples, which provides a clear instruction for LLMs. In this paper, we propose an ICL example selection method based on the nonsymmetric determinantal point process (NDPP) to capture positive and negative correlations, considering both the diversity and the relevance among ICL examples. Specifically, we optimize NDPP via kernel decomposition-based MLE to fit a constructed pseudo-labeled dataset, where we also propose a low-rank decomposition to reduce the computational cost. Further, we perform query-aware kernel adaptation on our NDPP to customize the input query, and we select examples via a MAP inference based on the adapted NDPP. Experimental results show our model outperforms strong baselines in ICL example selection.

Correlation-Aware Example Selection for In-Context Learning with Nonsymmetric Determinantal Point Processes

Dense Retrieval Models (DRMs) estimate the semantic similarity between queries and documents based on their embeddings. Prior studies highlight the importance of embedding contextualization in enhancing retrieval performance. To this aim, existing approaches primarily leverage token-level information derived from query/document interactions. In this paper, we introduce a novel DRM that leverages query/document interactions based on the full embedding representations generated by a Transformer-based model. To enhance similarity estimation, our model integrates external linguistic information about the Cognitive Complexity of texts, enriching the contextualization of embeddings. We empirically evaluate our approach across seven datasets and three different IR tasks to assess the impact of Cognitive Complexity-aware query and document embeddings for contextualization in dense retrieval. Results show that our approach consistently outperforms standard fine-tuning techniques on lightweight bi-encoders (e.g., BERT-based) and traditional late-interaction models (i.e., ColBERT) across all benchmarks. On larger retrieval-optimized bi-encoders like Contriever, our model achieves comparable or higher performance on four of the considered evaluation benchmarks. Our findings suggest that Cognitive Complexity-aware embeddings enhance query and document representations, improving retrieval effectiveness in DRMs.

Leveraging Cognitive Complexity of Texts for Contextualization in Dense Retrieval

Table question answering (TQA) requires accurate retrieval and reasoning over tabular data. Existing approaches attempt to retrieve relevant content before leveraging large language models (LLMs) for reasoning over long tables. However, these methods often fail to accurately retrieve contextually relevant data, resulting in information loss. They may also suffer from excessive encoding overhead and hallucination issues. In this paper, we propose TALON, a multi-agent framework designed for question answering over long tables. The framework features a planning agent that iteratively invokes a tool agent to access and manipulate tabular data based on intermediate feedback, progressively collecting the information necessary for answer generation, while a critic agent ensures accuracy and efficiency in tool usage and planning. To comprehensively assess the effectiveness of our framework, we introduce two benchmarks derived from the WikiTableQuestion and BIRD-SQL datasets, containing tables ranging from 50 to over 10,000 rows. Experimental results demonstrate that TALON achieves average accuracy improvements of 7.5% and 12.0% across all language models, establishing a new state-of-the-art in long-table question answering. Our code is publicly available at:https://anonymous.4open.science/r/TALON-uouxy.

TALON: A Multi-Agent Framework for Long-Table Exploration and Question Answering

Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform (1) quantitative evaluation using various prompts and models, (2) targeted manual assessment of the generated text and (3) qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.

Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Table retrieval, essential for accessing information through tabular data, is less explored compared to text retrieval. The row/column structure and distinct fields of tables (including titles, headers, and cells) present unique challenges. For example, different table fields have varying matching preferences: cells may favor finer-grained (word/phrase level) matching over broader (sentence/passage level) matching due to their fragmented and detailed nature, unlike titles. This necessitates a table-specific retriever to accommodate the various matching needs of each table field. Therefore, we introduce a Table-tailored HYbrid Matching rEtriever (THYME), which approaches table retrieval from a field-aware hybrid matching perspective. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that THYME significantly outperforms state-of-the-art baselines. Comprehensive analyses have confirmed the differing matching preferences across table fields and validated the efficacy of THYME.

Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective

Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (``bring-your-own'') KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. We will open-source our framework as a multi-strategy toolkit for graph linking and retrieval.

BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different information signals, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task—30-day readmission prediction from a psychiatric discharge—using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.

Downloads

Next from EMNLP 2025

UICOMPASS: UI Map Guided Mobile Task Automation via Adaptive Action Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES