China

Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose DeepAgentRank (DeAR), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In Stage 1, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact {3, 8}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In Stage 2, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making DeAR a highly effective and interpretable solution for modern reranking systems.

EMNLP 2025

DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation

reasoning model

reranking

retrieval

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

State-of-the-art neural machine translation (NMT) models deliver high-quality translations at the expense of large inference latency and energy consumption, requiring vast GPU fleets and contributing significantly to carbon emissions. To democratize and ``green'' NMT, we introduce the Green KNIGHT, a hardware-agnostic collection of recipes to optimize model performance in terms of speed and energy consumption, with only a minor trade-off in quality. On two high-resource benchmarks we show up to 91× CPU speedup and 94% energy savings for En→De, and 65× speedup and 10% energy usage for En→Ko; while incurring only minor losses of 9% relative BLEU. Our results prove that efficient and environmentally conscious NMT can be realized through optimizations build on well-understood, off-the-shelf techniques with no custom low-level code required, making our approach immediately deployable in real-world translation pipelines.

The Green KNIGHT: Green Machine Translation with Knowledge-Distilled, Narrow, Inexpensive, Greedy, Hybrid Transformers

Autoregressive Transformer (AT) dominates sequence-to-sequence generation tasks but suffers from high inference latency due to sequential token generation. Non-Autoregressive Transformer (NAT) improves inference efficiency by parallelizing token prediction, yet degrades generation quality. To address these limitations, we propose Tree-structured Non-Autoregressive Decoding (TNAD), a novel paradigm that bridges autoregressive and non-autoregressive decoding. TNAD generates a sentence through a top-down, layer-wise expansion of its constituency parse tree, enabling parallel generation within each layer while preserving contextual dependencies across layers. Experimental results on machine translation and paraphrase generation demonstrate that TNAD outperforms AT in efficiency and NAT in generation quality, thus offering a new alternative to AT and NAT in the trade-off between efficiency and quality.

Tree-Structured Non-Autoregressive Decoding for Sequence-to-Sequence Text Generation

We introduce Fourier Domain Adapter (FDA), a novel and parameter-efficient framework for fine-tuning large-scale pre-trained language models. FDA reparameterizes the core projection operation of the adapter module directly in the Fourier domain. This involves transforming the input features via discrete Fourier transform (DFT), applying sparse learnable complex modulations in frequency space, and then back-transforming via inverse DFT, supplemented by highly compact auxiliary linear layers. This approach significantly reduces the number of trainable parameters while enhancing the model's ability to capture salient frequency-based semantic information. Comprehensive experiments on GLUE, E2E NLG, and instruction tuning benchmarks show that our FDA consistently outperforms existing parameter-efficient fine-tuning (PEFT) methods. It can achieve better performance with nearly 100x fewer training parameters than traditional fine-tuning methods such as LoRA and AdapterH. Our results demonstrate that FDA is a robust and efficient solution for developing efficient and powerful language models.

Towards More Efficient Post-training via Fourier Domain Adapter Framework

We propose EditID, a training-free approach based on the DiT architecture, which achieves highly editable customized IDs for text to image generation. Existing text-to-image models for customized IDs typically focus more on ID consistency while neglecting editability. It is challenging to alter facial orientation, character attributes, and other features through prompts. EditID addresses this by deconstructing the text-to-image model for customized IDs into an image generation branch and a character feature branch. The character feature branch is further decoupled into three modules: feature extraction, feature fusion, and feature integration. By introducing a combination of mapping features and shift features, along with controlling the intensity of ID feature integration, EditID achieves semantic compression of local features across network depths, forming an editable feature space. This enables the successful generation of high-quality images with editable IDs while maintaining ID consistency, achieving excellent results in the IBench evaluation, which is an editability evaluation framework for the field of customized ID text-to-image generation that quantitatively demonstrates the superior performance of EditID. EditID is the first text-to-image solution to propose customizable ID editability on the DiT architecture, meeting the demands of long prompts and high-quality image generation.

EditID: Training-Free Editable ID Customization for Text-to-Image Generation

In Embedding Based Retrieval (EBR), Approximate Nearest Neighbor (ANN) algorithms are widely adopted for efficient large-scale search. However, recent studies reveal a query out-of-distribution (OOD) issue, where query and base embeddings follow mismatched distributions, significantly degrading ANN performance. In this work, we empirically verify the generality of this phenomenon and provide a quantitative analysis. To mitigate the distributional gap, we introduce a distribution regularizer into the encoder training objective, encouraging alignment between query and base embeddings. Extensive experiments across multiple datasets, encoders, and ANN indices show that our method consistently improves retrieval performance.

Alleviating Performance Degradation Caused by Out-of-Distribution Issues in Embedding-Based Retrieval

Large Multimodal Models (LMMs) have demonstrated strong performance on vision-language benchmarks, yet current evaluations predominantly focus on single-image reasoning. In contrast, real-world scenarios always involve understanding sequences of images. A typical scenario is comic strips understanding, which requires models to perform nuanced visual reasoning beyond surface-level recognition. To address this gap, we introduce STRIPCIPHER , a benchmark designed to evaluate the model ability on understanding implicit narratives in silent comics. STRIPCIPHER is a high-quality, human-annotated dataset featuring fine-grained annotations and comprehensive coverage of varying difficulty levels. It comprises three tasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. % , covering various difficulty. Notably, evaluation results on STRIPCIPHER reveals a significant gap between current LMMs and human performance---e.g., GPT-4o achieves only 23.93% accuracy in the reordering task, 56.07% below human levels. These findings underscore the limitations of current LMMs in implicit visual narrative understanding and highlight opportunities for advancing sequential multimodal reasoning.

Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?

Aspect-Opinion Pair Extraction (AOPE) and Aspect Sentiment Triplet Extraction (ASTE) have drawn growing attention in NLP. However, most existing approaches extract aspects and opinions independently, optionally adding pairwise relations, often leading to error propagation and high time complexity. To address these challenges and being inspired by transition-based dependency parsing, we propose the first transition-based model for AOPE and ASTE that performs aspect and opinion extraction jointly, which also better captures position-aware aspect-opinion relations and mitigates entity-level bias. By integrating contrastive-augmented optimization, our model delivers more accurate action predictions and jointly optimizes separate subtasks in linear time. Extensive experiments on four commonly used ASTE/AOPE datasets show that, our proposed transition-based model outperform previous models on two out of the four datasets when trained on a single dataset. When multiple training sets are used, our proposed method achieves new state-of-the-art results on all datasets. We show that this is partly due to our model's ability to benefit from transition actions learned from multiple datasets and domains. Our code is available at https://anonymous.4open.science/r/trans_aste-7079.

Train Once for All: A Transitional Approach for Efficient Aspect Sentiment Triplet Extraction

Generative large language models ( LLMs) have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or perplexity ( PPL ), which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.

QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory

Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general "meta-cognitive" awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM’s language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts—neurons likely responsible for newly acquired visual capabilities—while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.

Downloads

Next from EMNLP 2025

The Green KNIGHT: Green Machine Translation with Knowledge-Distilled, Narrow, Inexpensive, Greedy, Hybrid Transformers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES