China

Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross‑lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as subword segmentation granularity or token frequency. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer, namely the impact of semantically similar tokens shared across languages. We first analyze our models&#39; hidden representations and find that overlap *of any kind* creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. When testing cross-lingual transfer on downstream tasks, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.

EMNLP 2025

False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

vocabulary overlap

llms

subword tokenization

multilinguality

vocabulary

multilingual language models

language models

Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross‑lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as subword segmentation granularity or token frequency. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer, namely the impact of semantically similar tokens shared across languages. We first analyze our models' hidden representations and find that overlap *of any kind* creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. When testing cross-lingual transfer on downstream tasks, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large language models (LLMs) have emerged as powerful tools for diverse NLP tasks, yet their deployment as autonomous multi-agent systems (MAS) for general problem-solving in the industry remains challenging. Current MAS frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance on different academic and industrial tasks. To address these limitations, we propose AdaSwarm, a dynamic framework that enhances LLM-based MAS through a key innovation: a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter-efficient LLM fine-tuning. AdaSwarm eliminates the need for rigid, one-fits-all graph architectures, instead leveraging sample-specific idiosyncrasies to dynamically route queries through specialized agent networks. Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that AdaSwarm consistently outperforms state-of-the-art single-agent and MAS baselines across multiple LLM backbones. Our findings highlight the importance of sample-aware structural flexibility in LLM MAS designs.\footnote{Codes will be open-sourced to facilitate future research. }

AdaSwarm: Adaptive Graph Structure Selection for LLM-based Multi-agent System

Designing effective prompts is essential to guiding large language models (LLMs) toward desired responses. Automated prompt engineering aims to reduce reliance on manual efforts by streamlining the design, refinement, and optimization of natural language prompts. This paper proposes an optimal learning framework for automated prompt engineering for black-box models, designed to sequentially identify effective prompt features under limited evaluation budgets. We introduce a feature-based method to express prompt templates, which significantly broadens the search space. Bayesian regression is employed to utilize correlations among similar prompts, accelerating the learning process. To efficiently explore the large space of prompt features, we adopt the forward-looking Knowledge-Gradient (KG) policy for sequential optimal learning efficiently by solving mixed-integer second-order cone optimization problems, making it scalable and capable of accommodating prompts characterized only through constraints. Our method significantly outperforms a set of benchmark strategies assessed on instruction induction tasks within limited iterations of prompt evaluations, showing the potential of optimal learning for efficient prompt learning.

SOPL: A Sequential Optimal Learning Approach to Automated Prompt Engineering in Large Language Models

As autonomous agents and assistants, large language models (LLMs) often struggle with "hallucinations." Fundamentally, the problem is one of prioritization and balance: the LLM needs to understand or infer when it needs to be creative and balance that with its need to be accurate. Most efforts focus on either updating intrinsic knowledge via targeted post-training or by adding external knowledge sources which the LLM can reference neurosymbolically (e.g., via retrieval-augmented generation). However, these all eventually rely on the LLM's implicit reasoning ability during generation, still allowing for these random hallucinations despite high-quality training examples and references. Using aspect-oriented summarization as a case study, we propose **LOgit REwriting**(**LORE**), a new controlled generation paradigm which can simultaneously be faithful to external knowledge and to the LLM's intentions. LORE works by adding a rewriting module at left-to-right inference time, continuously reflecting on the newest prediction and trying to find a replacement that is more faithful to the source document. Then, it merges the logits of the replacement with those of the original prediction to generate the next token. We created a new long-context aspect-oriented summarization dataset, **SLPAspect**, and find that LORE generates 5.8% better summaries compared to the LLM without LORE-rewriting. All code and data from this paper will be available on GitHub after the anonymity period.

LORE: Continual Logit Rewriting Fosters Faithful Generation

Many applications that modern large language models (LLMs) are deployed on are retrieval tasks: the answer can be recovered from context and success is a matter of learning generalizable features from data. However, this is easier said than done. Overparametrized models trained on cross-entropy loss can overfit on noise. We argue that such overfitting is prone to happen when the model can identify mechanisms that rapidly drive down the loss of certain tokens early on in training. Fitting some tokens early reduce gradient signals in later iterations, as such, remaining tokens are more vulnerable to noise overfitting. We dub this phenomenon unequal learning and show that LLMs with longer contexts or larger embedding sizes are prone to this failure mode. In this work, we argue that learning training samples at an equal rate helps counter such biases. We highlight two mechanisms that promote equal learning: (i) loss functions that regularize uniform margins across training samples, (ii) small learning rates (e.g. by warming up) at the start of training. We demonstrate these approaches on various synthetic and natural language datasets.

Learning Is Not A Race: Improving Retrieval in Language Models via Equal Learning

Out-of-distribution (OOD) detection is a key safeguard for large language models, especially when they're deployed in real-world applications. However, existing OOD methods often struggle with prompts that are deliberately obfuscated, context-dependent, or superficially benign—making it hard to distinguish between harmless queries and adversarial or dangerous ones. These methods typically assess prompts in isolation, missing important semantic cues from the model's response. We introduce PROOD, prompt-response OOD detection, a framework that jointly analyzes LLM prompts *and their corresponding outputs* to improve semantic understanding. PROOD supports zero-shot multiclass detection using synthetic data generation and it offers a tunable probabilistic classification output. We validate PROOD on three challenging benchmarks—TrustLLM, OR-Bench, and AdvBench—where consistently outperforms prior OOD techniques, improving F1 scores by up to 6.3 points, from 0.871 to 0.934. Our results show that incorporating model responses enables more accurate, context-aware OOD detection in complex and adversarial prompt environments.

PROOD: A Simple LLM Out-of-Distribution Guardrail Leveraging Response Semantics

Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple‑Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce **Promptception**, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open‑source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU‑Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open‑source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

Promptception: How Sensitive Are Large Multimodal Models to Prompts?

In this paper, we propose a method to improve the reasoning capabilities of Visual Question Answering (VQA) systems by integrating Dense Passage Retrievers (DPRs) with Vision Language Models (VLMs). While recent works focus on the application of knowledge graphs and chain-of-thought reasoning, we recognize that the complexity of graph neural networks and end-to-end training remain significant challenges. To address these issues, we introduce **R**elevance **G**uided **VQA** (**RG-VQA**), a retriever-generator pipeline that uses DPRs to efficiently extract relevant information from structured knowledge bases. Our approach ensures scalability to large graphs without significant computational overhead. Experiments on the ScienceQA dataset show that RG-VQA achieves state-of-the-art performance, surpassing human accuracy and outperforming GPT-4 by more than . This demonstrates the effectiveness of RG-VQA in boosting the reasoning capabilities of VQA systems and its potential for practical applications.

RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering

Retrieval-Augmented Generation (RAG) is the current state-of-the-art method for mitigating the shortcomings of large language models (LLMs) by incorporating external knowledge sources to provide more relevant and accurate responses to user queries. However building performant RAG systems for real use-cases typically requires heavy investment from NLP experts, such as fine-tuning embedding models for specialized domains, experimenting with text chunking strategies and other niche hyperparameter tunings. We propose Embedding-Free RAG, a model-agnostic approach that enables the deployment of a one-size-fits-all RAG pipeline for user-provided grounding documents. Unlike traditional RAG, which relies on embedding models for information retrieval, Embedding-Free RAG leverages the generalized reasoning abilities of LLMs in a novel algorithmic framework during the retrieval stage. Extensive experiments demonstrate that Embedding-Free RAG outperforms existing state-of-the-art methods, achieving up to 4.6x higher F1 scores and up to 2x better question answering accuracy across a wide range of challenging domains.

Embedding-Free RAG

Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbf{Retrieval Augmented Correction (RAC)}, aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM's output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. Our extensive experiments show that RAC yields up to 30% improvements over the LLM baselines across two popular factuality evaluation datasets, validating its efficacy and robustness with and without the integration of Retrieval-Augmented Generation (RAG) across different LLMs. Notably, our method has reduced latency up to 40x and reduced token consumption up to 7x compared to previous state-of-the-art post-correction approaches with similar or better performance.

RAC: Efficient LLM Factuality Correction with Retrieval Augmentation

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46\%, WIL by 9.82\%, and improving speech quality by 5.84\% and intelligibility by 1.85\% on the LibriSpeech benchmark dataset.

Downloads

Next from EMNLP 2025

AdaSwarm: Adaptive Graph Structure Selection for LLM-based Multi-agent System

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES