China

Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA). Our findings demonstrate that representing tokens compositionally via ASG gives significant savings in embedding parameters (0.4-0.5%), while maintaining &gt; 95% task performance relative to the base model, even in generative tasks. Furthermore, ASG outperforms prior semantic grouping methods, particularly in preserving nuanced information crucial for zero-shot cross-lingual transfer. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling more compact models.

EMNLP 2025

Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics

shared semantics

product quantisation

token representation

k-means

word embedding

Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA). Our findings demonstrate that representing tokens compositionally via ASG gives significant savings in embedding parameters (0.4-0.5%), while maintaining > 95% task performance relative to the base model, even in generative tasks. Furthermore, ASG outperforms prior semantic grouping methods, particularly in preserving nuanced information crucial for zero-shot cross-lingual transfer. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling more compact models.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting—without validating SQLs via execution—tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Frontier language models are increasingly based on the Mixture of Experts (MoE) architecture, boosting the efficiency of training and inference by sparsely activating parameters. Nevertheless, training from scratch on trillions of tokens remains so expensive that most users can only finetune these models. In this work, we combine parameter reuse of dense models for the MoE layers ("*upcycling*") with a novel, *adaptive* Nexus router that can integrate new experts into an existing trained model without hurting the performance on previous domains. Our router leverages the knowledge of each expert's training data distribution via domain embeddings to initialize the router, improving specialization and allowing it to adapt faster to new domains than a standard MoE router. Nexus overturns the strict sequential separation between training and finetuning in classical approaches, allowing more powerful improvements to existing models at a later stage through long token-horizon trainings on new pretraining data. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and an 18.8% relative gain for extending the MoE to a new domain with a new expert by using limited finetuning data. This flexibility of Nexus can power an open-source ecosystem where every user continuously assembles their own MoE-mix from a multitude of dense models.

Nexus: Adaptive Upcycling to Efficiently Pretrain Mixture of Experts

Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks through chain-of-thought (CoT) reasoning. However, they suffer from high inference latency due to lengthy reasoning chains. In this paper, we propose SpecCoT, a collaborative framework that combines large and small models for effective yet efficient reasoning. Unlike traditional speculative decoding, which operates at the token level, SpecCoT adopts a step-level verification strategy: the large model first establishes the reasoning direction, and for each intermediate step, the small model generates multiple candidate drafts in parallel. The large model then verifies these drafts, either selecting the most suitable one or rejecting them all and generating its own. SpecCoT approach balances reasoning quality with inference efficiency through fine-grained model cooperation. Experiments across diverse tasks show SpecCoT reduces inference latency by 1.7-4times while maintaining comparable accuracy to standard large model inference.

SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration

Molecular optimization—modifying a given molecule to improve desired properties—is a fundamental task in drug discovery. While LLMs hold the potential to solve this task using natural language to drive the optimization, straightforward prompting achieves limited accuracy. In this work, we propose AgentDrug, an agentic workflow that leverages LLMs in a structured refinement process to achieve significantly higher accuracy. AgentDrug defines a nested refinement loop: the inner loop uses feedback from cheminformatics toolkits to validate molecular structures, while the outer loop guides the LLM with generic feedback and a gradient-based objective to steer the molecule toward property improvement. We evaluate AgentDrug on benchmarks with both single- and multi-property optimization under loose and strict thresholds. Results demonstrate significant performance gains over previous methods. With Qwen-2.5-3B, AgentDrug improves accuracy by 20.7% (loose) and 16.8% (strict) on six single-property tasks, and by 7.0% and 5.3% on eight multi-property tasks. With larger model Qwen-2.5-7B, AgentDrug further improves accuracy on 6 single-property objectives by 28.9% (loose) and 29.0% (strict), and on 8 multi-property objectives by 14.9% (loose) and 13.2% (strict).

AgentDrug: Utilizing Large Language Models in an Agentic Workflow for Zero-Shot Molecular Optimization

While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable document structure that could enhance knowledge acquisition and utilization. Motivated by this gap, we propose \textit{\textbf{R}etrieve-\textbf{D}ocument\textbf{R}oute-\textbf{R}ead} (\textbf{RDR\textsuperscript{2}}), a novel framework that explicitly incorporates document structure throughout the RAG process. RDR\textsuperscript{2} employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic behavior curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on three challenging datasets, RDR\textsuperscript{2} achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems' ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.

Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) at inference, utilizing techniques such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). In this paper, we introduce Reward-Guided Test-Time Computing (RTTC), a novel system designed to enhance the downstream accuracy of LLMs on client devices through test-time compute leveraging a remote multi-domain knowledge base through a server-client architecture. At test time, RTTC retrieves relevant samples from the server and performs retrieval-augmented generation or lightweight, adaptive fine-tuning on the client, thereby enhancing downstream performance across diverse domains. To further optimize efficiency, RTTC introduces a Query-State Caching (QSC) mechanism that leverages historical query embeddings, significantly reducing redundant computation and latency. Furthermore, we explore the integration of reward models to dynamically select the optimal TTC strategy—no adaptation, RAG, or fine-tuning—while maximizing performance and maintaining computational efficiency. Extensive experiments demonstrate that RTTC achieves up to 7% improvement in average downstream accuracy and speedup over vanilla RAG or TTT across multiple LLMs and tasks. Our results highlight the potential of RTTC for scalable, high-performance language model adaptation on client devices.

RTTC: Reward-Guided Collaborative Test-Time Compute

We examine evaluation of faithfulness to input data in the context of hotel highlights—brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (r=0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

Real-World Summarization: When Evaluation Reaches Its Limits

Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also be used to efficiently uncover scaling laws that guide the building of LLMs to achieve a desirable performance with significantly better cost-effectiveness.

Uncovering Scaling Laws for Large Language Models via Inverse Problems

Tool usage is a proven technique for developing high-performance reasoning in large language models (LLMs). Our work is focused on emphasizing the utility of leveraging multiple diverse tools for complex reasoning tasks. We present textbfMulti-TAG, a textbfMulti-textbfTool textbfAGgregation-based LLM framework that utilizes multiple diverse tools to solve complex math problems over multiple reasoning steps. At each reasoning step, textbfMulti-TAG invokes multiple tools and accepts the solution of the respective step by tools that have majority agreement on the final answer estimate. textbfMulti-TAG strongly outperforms several standard baselines that use individual tools with the same number of runs, highlighting the importance of multi-tool invocation for solving complex reasoning tasks. We also show that naive aggregation of multiple tools at each reasoning step also leads to substantial improvements of up to 35% accuracy. textbfMulti-TAG then further improves these gains by 7.4% on average on MATH500, AIME, AMC, and OlympiadBench.

Downloads

Next from EMNLP 2025

ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES