China

Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world commonsense anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. Our results show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

EMNLP 2025

CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments

vision question answering

anomaly detection

commonsense reasoning

multimodality

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The evolution of text-to-image (T2I) generation techniques has brought new capability for information visualization, and this advancement could have the potential to boost knowledge democratization and educational equity. In this paper, we envision these technologies as powerful tools to promote accessible healthcare knowledge education, which could not only serve the public but also more beneficial for communities in underserved regions and people with specific disabilities such as reading ability and attention limitations. We first explore how to harness recent T2I models to generate health knowledge flashcards, which are educational aids that aggregate knowledge with visually appealing and concise presentations in an image. Then, we curated a diverse and high quality healthcare knowledge flashcards datasets with 2034 samples from credible knowledge resources. We also validate the effectiveness of fine-tuning open-sourced models with our dataset to serve as a promising health flashcards generator. Our code is available at Anonymous Github: https://anonymous.4open.science/r/HealthCards

HealthCards: Exploring Text-to-Image Generation as Visual Aids for Healthcare Knowledge Democratizing and Education

Recent video generative models primarily rely on detailed, labor-intensive text prompts for tasks, like inpainting or style editing, limiting adaptability for personal/raw videos. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video editing method, supporting diverse video editing capabilities, such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P), which automatically generates structured video descriptions capturing both scene context and object details, and Paragraph-to-Video (P2V), where users (optionally) refine these descriptions to guide a video diffusion model for flexible content modifications, including removing, changing subjects, and/or adding new objects. Key contributions of RACCooN include: (1) A multi-granular spatiotemporal pooling strategy for structured video understanding, capturing both broad context and fine-grained details of major objects to enable precise text-based video editing without the need for complex human annotations. (2) A video generative model fine-tuned on our curated video-paragraph-mask dataset, enhances the editing and inpainting quality. (3) The capability to seamlessly generate new objects in videos by forecasting their movements through automatically generated mask planning. In the end, users can easily edit complex videos with RACCooN's automatic explanations and guidance. We demonstrate its versatile capabilities in video-to-paragraph generation (up to 9.4%p absolute improvement in human evaluations) and video content editing (relative to 49.7% lower FVD), and can be integrated with SoTA video generation models for further enhancement.

RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives

Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM's capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.

Do RAG Systems Really Suffer From Positional Bias?

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long chain-of-thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to better utilize computation, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, including MMVU, Video-MMMU, LongVideoBench, and Video-MME. Video-RTS surpasses existing video reasoning models by 1.9% in accuracy using only 1.4% of the training data, demonstrating the efficiency and effectiveness of our framework. Notably, pure RL training and adaptive video TTS of ours offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The model learns to extract language-related features, and then serves as a preference predictor in various downstream tasks. The policy can be learned according to the annotated preferences via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.

VLP: Vision-Language Preference Learning for Embodied Manipulation

Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.

QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models

Despite the advancements made in Vision Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tunes a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM's knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://code.github.com

Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference

We introduce a novel dual Graph Neural Network architecture that explicitly separates temporal dynamics from static handshape configurations in sign language processing. While handshapes serve as fundamental phonological units in sign languages, with American Sign Language employing 40--50 distinct handshapes, computational approaches rarely model them explicitly, limiting both recognition accuracy and linguistic analysis. Our approach combines anatomically-informed graph structures with contrastive learning to address key challenges in handshape recognition, including subtle inter-class distinctions and temporal variations. Achieving 46.07% accuracy across 37 handshape classes, a significant improvement over baseline methods (25.40%), we establish the first benchmark for structured handshape recognition in signing sequences. This work advances sign language processing by bridging computational models with linguistic structure, providing a framework for more accurate phonological modeling.

Improving Handshape Representations for Sign Language Processing: A Graph Neural Network Approach

Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods struggle to effectively transfer step-specific reasoning abilities, as they fail to account for the variation in accessible information and learning difficulty across reasoning steps. To address this, we propose Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is highly adaptable across various frameworks of multi-step retrieval-augmented language models, including those based on reasoning paths or question decomposition. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.

StepER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models

It has been demonstrated that incorporating external information as textual modality can effectively improve time series forecasting accuracy. However, current multi-modal models ignore the dynamic and different relations between time series patterns and textual features, which leads to poor performance in temporal-textual feature fusion. In this paper, we propose a lightweight and model-agnostic temporal-textual fusion framework named Cross-MoE. It replaces Cross Attention with Cross-Ranker to reduce computational complexity, and enhances modality-aware correlation memorization with Mixture-of-Experts (MoE) networks to tolerate the distributional shifts in time series. The experimental results demonstrate a 8.5% average reduction in Mean Squared Error (MSE) compared to the SOTA multi-modal time series framework. Notably, our method requires only 75% of computational overhead and 12.5% of memory usage compared with Cross Attention mechanism.

Downloads

Next from EMNLP 2025

HealthCards: Exploring Text-to-Image Generation as Visual Aids for Healthcare Knowledge Democratizing and Education

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES