China

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we &quot;warm up&quot; the model by distilling Long CoTs from a toy domain, namely, Knights \&amp; Knaves (K\&amp;K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval\textsuperscript{+}, and MMLU-Pro; (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

EMNLP 2025

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

sample-efficiency

reasoning

generalization

low-resource

reinforcement learning

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval\textsuperscript{+}, and MMLU-Pro; (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Lifelong model editing aims to dynamically adjust a model's output with respect to specific facts, knowledge points, or behaviors, enabling the model to adapt to the ever-changing demands of the real world without requiring retraining. While some retrieval-based methods have demonstrated potential in lifelong editing scenarios by storing edited knowledge in external memory, they often suffer from limitations in usability, such as requiring additional training corpora or lacking support for reversible and detachable edits. To address these issues, we propose a plug-and-play method for knowledge retrieval and storage, i.e., Layer-Level Prompting (LLP), which enables seamless and efficient lifelong model editing. In our LLP framework, the reasoning process of LLMs is divided into two stages, respectively knowledge retrieval (Think) and knowledge injection(Recall). Specifically, the knowledge retrieval process is performed in the early layers of the model. Based on the retrieved information, the model is guided to access the updated knowledge stored in the subsequent layer to complete the knowledge editing process. Experimental results demonstrate that our method consistently outperforms existing techniques on lifelong model editing tasks, achieving superior performance on question answering and hallucination benchmarks across different LLMs.

Think and Recall: Layer-Level Prompting for Lifelong Model Editing

Speech Language Models (SLMs) enable natural interactions via spoken instructions, which has the potential to more effectively capture user intent by detecting nuances in their speech. This enhanced functionality, however, introduces new security risks as it enables adversaries to bypass safety mechanisms by injecting noise into the input. In this work, we analyze the vulnerability of open-source SLMs to adversarial attacks and evaluate various defense mechanisms. In our experiments, we harnessed a standard PGD in white-box scenario. We find that these models are susceptible to jailbreaks with 100% attack success rates in some instances. We propose post hoc defense techniques that include activation patching to improve robustness up to 99% with a negligible impact on utility. Additionally, we evaluate defenses applied at both the audio encoder and the language model components, weighing their impact on adversarial resistance and usability.

SPIRIT: Patching Speech Language Models against Jailbreak Attacks

Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5% of the training data needed by the Random baseline to reach the target performance.

FIRE: Flexible Integration of Data Quality Ratings for Effective Pretraining

In recent years, developing compact and efficient large language models (LLMs) has emerged as a thriving area of research. However, traditional Supervised Fine-Tuning (SFT), which relies on singular ground truth labels, often fails to capture token-level dependencies and linguistic diversity. To address these limitations, we propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation. Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity. This ensures more reliable and effective training. To validate our approach, we constructed a large-scale 1.2M logits dataset and trained a series of science-focused models. Experimental results demonstrate that our method achieves significant improvements over current SOTA, with accuracy gains of 18% on Mawps and 22.7% on TabMWP. Across nine widely used mathematical benchmarks, our method consistently outperforms prior SFT models, achieving an average improvement of 7.28%. All code and datasets will be open-sourced.

Logits-Based Finetuning

Compressing Large Language Models (LLMs) into task-specific Small Language Models (SLMs) encounters two significant challenges: safeguarding domain-specific knowledge privacy and managing limited resources. To tackle these challenges, we propose PPC-GPT, a novel unified framework that systematically addresses both privacy preservation and model compression in federated settings. PPC-GPT works on a server-client federated architecture, where the client sends differentially private (DP) perturbed task-specific data to the server's LLM. The LLM then generates synthetic data along with their corresponding rationales. This synthetic data is subsequently used for both LLM pruning and retraining processes. Our framework's key innovation lies in its holistic integration of privacy-preserving mechanisms, synthetic data generation, and task-specific compression techniques, creating unique benefits through component interaction. Our experiments across diverse text generation tasks demonstrate that PPC-GPT successfully achieves dual objectives: maintaining competitive performance comparable to full-sized LLMs while ensuring robust privacy protection through its federated architecture.

PPC-GPT: Federated Task-Specific Compression of Large Language Models via Pruning and Chain-of-Thought Distillation

This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures--Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501)--using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4–8×) and notable decoding speed improvements (up to 2.4×), without compromising generation quality. These results highlight the method's suitability for memory-constrained environments and large-scale deployments.

Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models' comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA to investigate the cultural literacy of multilingual LLMs. XNationQA encompasses questions on the geography, culture, and history of nine countries, containing a total of 49,280 questions in seven languages. We benchmark eight standard multilingual LLMs on XNationQA. Our analyses uncover a considerable discrepancy in the model's accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models show better performance in Western languages, though this does not translate to their being more literate for Western countries, which is counter-intuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.

 Do You Know About My Nation? Investigating Multilingual Language Models’ Cultural Literacy Through Factual Knowledge

Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlook the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. We decompose task-specific representations into two complementary neuronal subspaces that regulate neuron sensitivity and input adaptability. Leveraging this decomposition, we introduce NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrate that NeuroMerging achieves superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlight the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Code will be released upon acceptance.

To See a World in a Spark of Neuron: Disentangling Multi-Task Interference for Training-Free Model Merging

Upgrading embedding models in production vector databases typically necessitates re-encoding the entire corpus and rebuilding the Approximate Nearest Neighbor (ANN) index, leading to significant operational disruption and computational cost. This paper presents Drift-Adapter, a lightweight, learnable transformation layer designed to bridge embedding spaces between model versions. By mapping new queries into the legacy embedding space, Drift-Adapter enables the continued use of the existing ANN index, effectively deferring full re-computation. We systematically evaluate three adapter parameterizations: Orthogonal Procrustes, Low-Rank Affine, and a compact Residual MLP, trained on a small sample of paired old/new embeddings. Experiments on MTEB text corpora and a CLIP image model upgrade (1M items) show that Drift-Adapter recovers 95–99% of the retrieval recall (Recall@10, MRR) of a full re-embedding, adding less than 10,μs query latency. Compared to operational strategies like full re-indexing or dual-index serving, Drift-Adapter dramatically reduces recompute costs (by over 100 times) and facilitates upgrades with near-zero operational interruption. We analyze robustness to varied model drift, training data size, scalability to billion-item systems, and the impact of design choices like diagonal scaling, demonstrating Drift-Adapter's viability as a pragmatic solution for agile model deployment.

Downloads

Next from EMNLP 2025

Think and Recall: Layer-Level Prompting for Lifelong Model Editing

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES