China

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce \textsc{ViLBench}, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI&#39;s GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark&#39;s challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models --- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on \textsc{ViLBench} by selecting OpenAI o1’s generations.

EMNLP 2025

ViLBench: A Suite for Vision-Language Process Reward Modeling

process reward model

vision-language

benchmark

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce \textsc{ViLBench}, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models --- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on \textsc{ViLBench} by selecting OpenAI o1’s generations.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Faithful generation in large language models (LLMs) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive decoding methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple LLMs on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA's state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.

CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

Medical visual question answering (VQA) and federated learning (FL) have emerged as vital approaches for enabling privacy-preserving, collaborative learning across clinical institutions. However, both these approaches face significant challenges in cross-modal FL scenarios, where each client possesses unpaired images from only one modality. To address this limitation, we propose X-FLoRA, a cross-modal FL framework that uses modality-expert low-rank adaptation (LoRA) for medical VQA. Specifically, X-FLoRA enables the synthesis of images from one modality to another without requiring data sharing between clients. This is achieved by training a backward translation model within a federated asymmetric translation scheme that integrates clinical semantics from textual data. Additionally, X-FLoRA introduces modality-expert LoRA, which fine-tunes separate LoRA modules to strengthen modality-specific representations in the VQA task. The server aggregates the trained backward translation models and fine-tuned LoRA modules using discriminator quality scores and expert-aware weighting, which regulate the relative contributions from different clients. Experiments were conducted on VQA datasets encompassing different medical modalities, and the results demonstrate that X-FLoRA outperforms existing FL methods in terms of VQA performance.

X-FLoRA: Cross-modal Federated Learning with Modality-expert LoRA for Medical VQA

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce \textbf{ViDoSeek}, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose \textbf{ViDoRAG}, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code will be available.

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Long-range tasks demand reasoning over long inputs. However, existing solutions are limited, e.g., long-context models require large compute budgets, parameter-efficient fine-tuning (PEFT) needs training data, and retrieval-augmented generation (RAG) entails complex task-specific designs. Though in-context approaches overcome many of these issues, methods with short-context LLMs are inefficient, trading context for processing more tokens. We introduce **PRISM**, a highly token-efficient in-context method based on structured schemas that outperforms baselines on diverse tasks with **4x shorter contexts**. This approach produces concise outputs and efficiently leverages key-value (KV) caches to **reduce costs by up to 54%**. PRISM scales down to tiny contexts without increasing costs or sacrificing quality, and generalizes to new tasks with minimal effort by generating schemas from task descriptions.

PRISM: Efficient Long-Range Reasoning With Short-Context LLMs

Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.

Calibrating LLM Confidence by Probing Perturbed Representation Stability

Unlearning evaluation has traditionally followed the retrieval paradigm, where adversaries attempt to extract residual knowledge of an unlearning target by issuing queries to a language model. However, the absence of retrievable knowledge does not necessarily prevent an adversary from inferring which targets have been intentionally unlearned in the post-training optimization. Such inferences can still pose significant privacy risks, as they may reveal the sensitive data in the model's training set and the internal policies of model creators. To quantify such privacy risks, we propose a new evaluation framework **Forensic Unlearning Membership Attacks (FUMA)**, drawing on principles from membership inference attacks. FUMA assesses whether unlearning leaves behind detectable artifacts that can be exploited to infer membership in the forget set. Specifically, we evaluate four major optimization-based unlearning methods on 258 models across diverse unlearned settings and show that examples in the forget set can be identified up to 99% accuracy. This highlights privacy risks not covered in existing retrieval-based benchmarks. We conclude by discussing recommendations to mitigate these vulnerabilities.

Identifying Unlearned Data in LLMs via Membership Inference Attacks

As the deployment of AI models shifts towards edge devices, developing efficient sequence models has become critical. State-space models (SSMs), particularly Mamba, have emerged as strong rivals to Transformers due to their linear-time complexity and impressive performance across a range of tasks. However, their large parameter counts still hinder their use in resource-constrained environments. To address this, we propose a novel unstructured pruning framework specifically tailored for Mamba, achieving up to 70% parameter reduction with only a 3–9% drop in performance. Unlike pruning techniques designed for Transformers, our approach leverages Mamba's unique recurrent dynamics by incorporating pruning based on both weight and gradient importance to preserve critical parameters, a gradual pruning schedule to maintain model stability, and a global strategy to optimize parameter allocation across the model. Extensive experiments on the WikiText-103, Long Range Arena, and ETT benchmarks demonstrate significant efficiency gains, including 1.77× faster inference and a 46% reduction in memory usage. Our component analysis confirms Mamba's robustness to pruning, highlighting the framework's potential for enabling practical deployment while underscoring the need for careful evaluation to avoid introducing biases in sensitive applications.

Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments

Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tun- ing may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-context few-shot learning performance. In this work, we study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning via the partial adaption method. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark covering a variety of classic natural language tasks. This comes at the cost of losing some degree of instruction following ability as measured by AlpacaEval. Our study shines light on the potential trade-off between in-context learning and instruction following abilities that is worth considering in practice.

Improving Instruct Models for Free: A Study on Partial Adaptation

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge and follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

Downloads

Next from EMNLP 2025

CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES