China

Test-Time Scaling (TTS) is a promising new approach to progressively elicit the model’s intelligence during inference. Recently, training-based test-time scaling methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free methods have faded from prominence. However, the additional train-time computation amplifies the burden on test-time scaling. In this paper, we design a finer-grained sequential scaling method called Conditional Step-level Self-refinement with the support of process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that this hybrid strategy incorporating multiple training-free test-time scaling methods at finer granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

EMNLP 2025

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

test-time scaling

self-refinement

large language models

reasoning

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Vision-language models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, Inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbf{Med-VRAgent}). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-RAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.

Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents

Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose textbfMETok, a training-free, textbfMulti-stage textbfEvent-based textbfToken compression framework designed to accelerate VLLMs' inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6\% FLOPs reduction and 93.5\% KV Cache memory savings, all while maintaining comparable or even superior accuracy.

METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, *they lack a fundamental understanding of how MLLMs process and fuse multimodal information*. Through systematic analysis, we uncover a three-stage cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose *VisiPruner*, a training-free pruning framework that reduces **99.9%** of vision-related attention computations and **62.8%** of FLOPs while maintaining performance. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.

VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

With the emergence of new topics on social media as sources of rumor propagation, addressing the domain shift between the source and target domain and the target domain samples scarcity remains a crucial task in cross-domain rumor detection. Traditional deep learning-based methods and LLM-based methods are mostly focused on the in-domain condition, thus having poor performance in cross-domain setting. Existing domain adaptation rumor detection approaches ignore the data generalization differences and rely on a large amount of unlabeled target domain samples to achieve domain adaptation, resulting in less effective on emerging topic rumor detection. In this paper, we propose a Gradient Coherence guided Meta-Learning approach (GCML) for emerging topics rumor detection. Firstly, we calculate the task generalization score of each source task (sampled from source domain) from a gradient coherence perspective, and selectively learn more ''generalizable'' tasks that are more beneficial in adapting to the target domain. Secondly, we leverage meta-learning to alleviate the target domain samples scarcity, which utilizes task generalization scores to re-weight meta-test gradients and adaptively updates learning rate. Extensive experimental results on real-world datasets show that our method substantially outperforms SOTA baselines.

GCML: Gradient Coherence Guided Meta-Learning for Cross-Domain Emerging Topic Rumor Detection

Processing long contexts is increasingly important for Large Language Models (LLMs) in tasks like multi-turn dialogues, code generation, and document summarization. This paper addresses the challenges of achieving high long-context performance, low computational complexity, and compatibility with pretrained models -- collectively termed the ``impossible triangle''. We introduce E2LLM (Encoder Elongated Large Language Models), a novel approach that effectively navigates this paradox. E2LLM divides long contexts into chunks, compresses each into soft prompts using a pretrained text encoder, and aligns these representations with a decoder-only LLM via an adapter. To enhance the LLM's reasoning with these soft prompts, we employ two training objectives: encoder output reconstruction and long-context instruction fine-tuning. Extensive experiments reveal that E2LLM not only outperforms 8 state-of-the-art (SOTA) methods in effectiveness and efficiency for document summarization and question answering, but also achieves the best performance on LongBench v2 among models of comparable size. Code will be available upon publication.

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. Experiments on medical and legal datasets show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall at 0.1% false positive rate threshold. In adversarial settings, DivScore demonstrates superior robustness than other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall.

DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains

Knowledge Tracing (KT) aims to model a student’s learning state over time and predict their future performance. However, traditional KT methods often face challenges in explainability, scalability, and effective modeling of complex knowledge dependencies. While Large Language Models (LLMs) present new avenues for KT, their direct application often struggles with generating structured, explainable student representations and lacks mechanisms for continuous, task-specific refinement. To address these gaps, we propose Collaborative Iterative Knowledge Tracing (CIKT), a framework that harnesses LLMs to enhance both prediction accuracy and explainability. CIKT employs a dual-component architecture: an Analyst generates dynamic, explainable user profiles from student historical responses, and a Predictor utilizes these profiles to forecast future performance. The core of CIKT is a synergistic optimization loop. In this loop, the Analyst is iteratively refined based on the predictive accuracy of the Predictor, which conditions on the generated profiles, and the Predictor is subsequently retrained using these enhanced profiles. Evaluated on multiple educational datasets, CIKT demonstrates significant improvements in prediction accuracy, offers enhanced explainability through its dynamically updated user profiles, and exhibits improved scalability. Our work presents a robust and explainable solution for advancing knowledge tracing systems, effectively bridging the gap between predictive performance and model transparency.

CIKT: A Collaborative and Iterative Knowledge Tracing Framework with Large Language Models

Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 80.4% on Object HalBench and 52.6% on MM HalBench, thereby enhancing the trustworthiness of LVLMs. The code and dataset will be made publicly available.

Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

Large Language Models (LLMs) excel in general language tasks, motivating their adaptation to specialized domains such as healthcare. Effective domain adaptation typically involves supervised fine-tuning (SFT) on carefully selected instruction-tuning data. Current data selection methods adopt a data-centric approach, relying on external annotations and heuristics to identify external defined high-quality and challenging data. Our exploratory experiments highlight this approach fails to improve model's domain performance, due to misalignment between selected data and the model’s knowledge distribution. To tackle this, we propose Decomposed Difficulty-based Data Selection (3DS), a two-stage model-centric data selection framework that aligns data selection with the model’s distribution. 3DS employs a Prompt-Driven Data Selection to filter out noisy data based on the model's knowledge via explicit alignment in Stage#1, then adopts a Decomposed Difficulty-based Data Selection to guide selection via three novel data difficulty metrics, including Instruction Understanding, Response Confidence, and Response Correctness in Stage#2. These metrics are enhanced by an attention-based importance weighting mechanism for accurate calibration. Extensive experiments in the healthcare domain show 3DS outperforms existing methods by over 2.97% accuracy, with additional validation in the law domain confirming its generalization ability. Our dataset and code are open-sourced at https://anonymous.4open.science/r/3DS-E67F.

3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection

Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction.

Downloads

Next from EMNLP 2025

Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES