China

Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical problems. Current researches typically endeavor to unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms&#39; strengths for mutual enhancement, and ultimately achieving simultaneous improvement. We conduct a detailed analysis of error types of these two paradigms, based on which we propose **Parrot**, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate the sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward design to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the LLaMA2 and CodeLLaMA&#39;s N-CoT performance even achieve +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

EMNLP 2025

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

chain-of-thought

large language model

reasoning

Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical problems. Current researches typically endeavor to unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement, and ultimately achieving simultaneous improvement. We conduct a detailed analysis of error types of these two paradigms, based on which we propose **Parrot**, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate the sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward design to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the LLaMA2 and CodeLLaMA's N-CoT performance even achieve +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM’s parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.

Conflict-Aware Soft Prompting for Retrieval-Augmented Generation

Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world commonsense anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. Our results show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments

The evolution of text-to-image (T2I) generation techniques has brought new capability for information visualization, and this advancement could have the potential to boost knowledge democratization and educational equity. In this paper, we envision these technologies as powerful tools to promote accessible healthcare knowledge education, which could not only serve the public but also more beneficial for communities in underserved regions and people with specific disabilities such as reading ability and attention limitations. We first explore how to harness recent T2I models to generate health knowledge flashcards, which are educational aids that aggregate knowledge with visually appealing and concise presentations in an image. Then, we curated a diverse and high quality healthcare knowledge flashcards datasets with 2034 samples from credible knowledge resources. We also validate the effectiveness of fine-tuning open-sourced models with our dataset to serve as a promising health flashcards generator. Our code is available at Anonymous Github: https://anonymous.4open.science/r/HealthCards

HealthCards: Exploring Text-to-Image Generation as Visual Aids for Healthcare Knowledge Democratizing and Education

Recent video generative models primarily rely on detailed, labor-intensive text prompts for tasks, like inpainting or style editing, limiting adaptability for personal/raw videos. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video editing method, supporting diverse video editing capabilities, such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P), which automatically generates structured video descriptions capturing both scene context and object details, and Paragraph-to-Video (P2V), where users (optionally) refine these descriptions to guide a video diffusion model for flexible content modifications, including removing, changing subjects, and/or adding new objects. Key contributions of RACCooN include: (1) A multi-granular spatiotemporal pooling strategy for structured video understanding, capturing both broad context and fine-grained details of major objects to enable precise text-based video editing without the need for complex human annotations. (2) A video generative model fine-tuned on our curated video-paragraph-mask dataset, enhances the editing and inpainting quality. (3) The capability to seamlessly generate new objects in videos by forecasting their movements through automatically generated mask planning. In the end, users can easily edit complex videos with RACCooN's automatic explanations and guidance. We demonstrate its versatile capabilities in video-to-paragraph generation (up to 9.4%p absolute improvement in human evaluations) and video content editing (relative to 49.7% lower FVD), and can be integrated with SoTA video generation models for further enhancement.

RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives

Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM's capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.

Do RAG Systems Really Suffer From Positional Bias?

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long chain-of-thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to better utilize computation, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, including MMVU, Video-MMMU, LongVideoBench, and Video-MME. Video-RTS surpasses existing video reasoning models by 1.9% in accuracy using only 1.4% of the training data, demonstrating the efficiency and effectiveness of our framework. Notably, pure RL training and adaptive video TTS of ours offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The model learns to extract language-related features, and then serves as a preference predictor in various downstream tasks. The policy can be learned according to the annotated preferences via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.

VLP: Vision-Language Preference Learning for Embodied Manipulation

Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.

QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models

Despite the advancements made in Vision Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tunes a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM's knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://code.github.com

Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference

We introduce a novel dual Graph Neural Network architecture that explicitly separates temporal dynamics from static handshape configurations in sign language processing. While handshapes serve as fundamental phonological units in sign languages, with American Sign Language employing 40--50 distinct handshapes, computational approaches rarely model them explicitly, limiting both recognition accuracy and linguistic analysis. Our approach combines anatomically-informed graph structures with contrastive learning to address key challenges in handshape recognition, including subtle inter-class distinctions and temporal variations. Achieving 46.07% accuracy across 37 handshape classes, a significant improvement over baseline methods (25.40%), we establish the first benchmark for structured handshape recognition in signing sequences. This work advances sign language processing by bridging computational models with linguistic structure, providing a framework for more accurate phonological modeling.

Downloads

Next from EMNLP 2025

Conflict-Aware Soft Prompting for Retrieval-Augmented Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES