China

Scaling test-time computation—generating and analyzing multiple or sequential outputs for a single input—has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances in uncertainty quantification and multi-step reasoning. A key shared component is semantic clustering, which groups outputs that differ in form but convey the same meaning. Semantic clustering enables estimation of the distribution over the semantics of outputs and helps avoid redundant exploration of reasoning paths. However, existing approaches typically rely on external models, which introduce substantial computational overhead and often fail to capture context-aware semantics. We propose Latent Semantic Clustering (LSC), a lightweight and context-sensitive method that leverages the generator LLM’s internal hidden states for clustering, eliminating the need for external models. Our extensive experiment across various LLMs and datasets shows that LSC significantly improves the computational efficiency of test-time scaling while maintaining or exceeding the performance of existing methods.

EMNLP 2025

Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs

test-time compute scaling

llm uncertainty quantification

semantic clustering

llm reasoning

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Web banner advertisements, which are placed on websites to guide users to a targeted landing page (LP), are still often selected manually because human preferences are important in selecting which ads to deliver. To automate this process, we propose a new benchmark, BannerBench, to evaluate the human preference-driven banner selection process using vision-language models (VLMs). This benchmark assesses the degree of alignment with human preferences in two tasks: a ranking task and a best-choice task, both using sets of five images derived from a single LP. Our experiments show that VLMs are moderately correlated with human preferences on the ranking task. In the best-choice task, most VLMs perform close to chance level across various prompting strategies. These findings suggest that although VLMs have a basic understanding of human preferences, most of them struggle to pinpoint a single suitable option from many candidates.

BannerBench: Benchmarking Vision Language Models for Multi-Ad Selection with Human Preferences

The exponential growth of scientific publications has overwhelmed reviewers and researchers, with top conferences receiving thousands of submissions annually. Reviewers must assess feasibility, novelty, and impact under tight deadlines, often lacking tools to identify relevant prior work. Early-career researchers face similar challenges, with limited support to navigate fast-evolving fields. Existing LLM-based systems struggle with static retrieval, surface-level features, and lack multi-hop reasoning, leading to shallow or hallucinated assessments. Scientific evaluation requires a deep, relational understanding, which current retrieval-augmented generation (RAG) methods fail to achieve. We introduce SciCompanion, a graph-grounded reasoning framework for structured scientific evaluation. Given a paper or abstract-like input, SciCompanion builds a dynamic knowledge graph from recent publications, domain-specific databases, and curated metadata. It employs multi-hop reasoning to iteratively construct contextual graphs and generate structured critiques, enabling deeper exploration of scientific literature. Unlike sentiment-biased LLM evaluations, SciCompanion directly optimizes retrieval and graph refinement using Group Relative Policy Optimization (GRPO), producing reviews aligned with expert judgments. Experiments on ICLR and ACL datasets show that SciCompanion reduces evaluation error by over 30% compared to prompting-only baselines and allows smaller models to outperform larger ones. Evaluations across three datasets, using metrics for retrieval accuracy, semantic overlap, and multi-hop sensitivity, along with a case study, demonstrate SciCompanion's robustness and versatility.

SciCompanion: Graph-Grounded Reasoning for Structured Evaluation of Scientific Arguments

Infodemics and health misinformation have significant negative effects on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in Generative AI, capable of producing realistic, human-like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat Infodemics, most of the existing work focus on developing misinformation datasets from social media and fact-checking platforms, but face limitations in topical coverage, inclusion of AI-generation, and accessibility of raw content. To address these issues, we present MM-Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM-Health includes human-generated multimodal information (5,776 articles) and AI-generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks—reliability checks, originality checks, and fine-grained AI detection—demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine-generated content at multimodal levels. Our code and data is available at: \url{https://anonymous.4open.science/r/MM-Health-Supplementary-Material-E14C}}.

From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA). Our findings demonstrate that representing tokens compositionally via ASG gives significant savings in embedding parameters (0.4-0.5%), while maintaining > 95% task performance relative to the base model, even in generative tasks. Furthermore, ASG outperforms prior semantic grouping methods, particularly in preserving nuanced information crucial for zero-shot cross-lingual transfer. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling more compact models.

Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics

Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting—without validating SQLs via execution—tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Frontier language models are increasingly based on the Mixture of Experts (MoE) architecture, boosting the efficiency of training and inference by sparsely activating parameters. Nevertheless, training from scratch on trillions of tokens remains so expensive that most users can only finetune these models. In this work, we combine parameter reuse of dense models for the MoE layers ("*upcycling*") with a novel, *adaptive* Nexus router that can integrate new experts into an existing trained model without hurting the performance on previous domains. Our router leverages the knowledge of each expert's training data distribution via domain embeddings to initialize the router, improving specialization and allowing it to adapt faster to new domains than a standard MoE router. Nexus overturns the strict sequential separation between training and finetuning in classical approaches, allowing more powerful improvements to existing models at a later stage through long token-horizon trainings on new pretraining data. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and an 18.8% relative gain for extending the MoE to a new domain with a new expert by using limited finetuning data. This flexibility of Nexus can power an open-source ecosystem where every user continuously assembles their own MoE-mix from a multitude of dense models.

Nexus: Adaptive Upcycling to Efficiently Pretrain Mixture of Experts

Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks through chain-of-thought (CoT) reasoning. However, they suffer from high inference latency due to lengthy reasoning chains. In this paper, we propose SpecCoT, a collaborative framework that combines large and small models for effective yet efficient reasoning. Unlike traditional speculative decoding, which operates at the token level, SpecCoT adopts a step-level verification strategy: the large model first establishes the reasoning direction, and for each intermediate step, the small model generates multiple candidate drafts in parallel. The large model then verifies these drafts, either selecting the most suitable one or rejecting them all and generating its own. SpecCoT approach balances reasoning quality with inference efficiency through fine-grained model cooperation. Experiments across diverse tasks show SpecCoT reduces inference latency by 1.7-4times while maintaining comparable accuracy to standard large model inference.

SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration

Molecular optimization—modifying a given molecule to improve desired properties—is a fundamental task in drug discovery. While LLMs hold the potential to solve this task using natural language to drive the optimization, straightforward prompting achieves limited accuracy. In this work, we propose AgentDrug, an agentic workflow that leverages LLMs in a structured refinement process to achieve significantly higher accuracy. AgentDrug defines a nested refinement loop: the inner loop uses feedback from cheminformatics toolkits to validate molecular structures, while the outer loop guides the LLM with generic feedback and a gradient-based objective to steer the molecule toward property improvement. We evaluate AgentDrug on benchmarks with both single- and multi-property optimization under loose and strict thresholds. Results demonstrate significant performance gains over previous methods. With Qwen-2.5-3B, AgentDrug improves accuracy by 20.7% (loose) and 16.8% (strict) on six single-property tasks, and by 7.0% and 5.3% on eight multi-property tasks. With larger model Qwen-2.5-7B, AgentDrug further improves accuracy on 6 single-property objectives by 28.9% (loose) and 29.0% (strict), and on 8 multi-property objectives by 14.9% (loose) and 13.2% (strict).

AgentDrug: Utilizing Large Language Models in an Agentic Workflow for Zero-Shot Molecular Optimization

While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable document structure that could enhance knowledge acquisition and utilization. Motivated by this gap, we propose \textit{\textbf{R}etrieve-\textbf{D}ocument\textbf{R}oute-\textbf{R}ead} (\textbf{RDR\textsuperscript{2}}), a novel framework that explicitly incorporates document structure throughout the RAG process. RDR\textsuperscript{2} employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic behavior curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on three challenging datasets, RDR\textsuperscript{2} achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems' ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.

Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

Downloads

Next from EMNLP 2025

BannerBench: Benchmarking Vision Language Models for Multi-Ad Selection with Human Preferences

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES