China

We propose Paired by the Teacher (PbT), a two-stage teacher–student pipeline for synthesizing accurate input–output pairs without any human labeling or existing parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like recaps, highlights, or questions, or only raw inputs, such as dialogues, articles, or paragraphs, but seldom both sides of the parallel data, unless we perform human labeling. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. In PbT, a teacher LLM first transforms each unpaired example into a concise intermediate representation (IR), and a student model learns to invert this transformation to reconstruct the original input from the IR. This enables us to pair each output with its generated input, creating high-quality paired data. We evaluate PbT on five benchmarks—dialogue summarization (SAMSum, DialogSum), document summarization (XSum, CNNDM), and question generation (SQuAD)—and an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, closing the gap to human-annotated pairs to within 2 ROUGE points. Human evaluation on SwitchBoard further confirms that only PbT meets target summary lengths with concise, faithful outputs, while all baselines remain overly verbose.

EMNLP 2025

Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation

teacher–student learning

unpaired data

synthetic data

question generation

dialogue summarization

data augmentation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models (LLMs) have demonstrated a remarkable understanding of language nuances through instruction tuning, enabling them to effectively tackle various natural language processing tasks. Recent research has focused on the quality of instruction data rather than the quantity of instructions. However, existing high-quality instruction selection methods rely on external models or rules, overlooking the intrinsic association between pre-trained model and instruction data, making it difficult to select data that align with the preferences of pre-trained model. To address this challenge, we propose a strategy that utilizes noise injection to identify the quality of instruction data, without relying on external model. We also implement the strategy of combining inter-class diversity and intra-class diversity to improve model performance. The experimental results demonstrate that our method significantly outperforms the model trained on the entire dataset and established baselines. Our study provides a new perspective on noise injection in the field of instruction tuning, and also illustrates that the pre-trained model itself should be considered in defining high-quality. Additionally, we publish our selected high-quality instruction data.

Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection

The deployment of Large Language Models (LLMs) faces significant challenges due to high computational costs, driving the demand for effective pruning techniques. Existing structured pruning methods employ uniform compression rates across network layers, neglecting the varying importance of different network depths. To address this limitation, we propose a novel optimization framework that directly minimizes global capability loss through layer-adaptive pruning rates. The framework formulates the pruning task as a combinatorial optimization problem constrained by a total parameter budget, and an efficient dynamic programming solution is derived to determine optimal layer-wise compression rates. Experiments demonstrate that, when tuning is not included, our approach achieves comparable performance with state-of-the-art methods at high pruning rates (37 - 50% reduction), and shows significant advantages at low pruning rates (25% reduction). When tuning is included, our method achieves the best performance among the compared methods.

GAP: a Global Adaptive Pruning Method for Large Language Models

Bronze inscriptions from early China are often fragmentary, with missing or undeciphered characters limiting linguistic and historical analysis. Addressing this challenge requires models that can generalize across orthographic variation and diachronic script change. This paper introduces three contributions to support computational processing of bronze inscriptions: (i) a fully digitized and Unicode-encoded corpus of over 40,000 inscriptional characters; (ii) a glyph network linking diachronic variants to shared semantic anchors; and (iii) a masked language modeling (MLM) framework with variant-aware augmentation, alongside a periodization classification task. Experiments show that domain-adaptive pretraining and glyph-aware modeling substantially improve restoration accuracy.

BIRD: Bronze Inscription Restoration and Dating

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (*TokenSelect*), a training-free method for efficient and accurate long-context inference. *TokenSelect* builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, *TokenSelect* selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate *TokenSelect*, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of *TokenSelect* demonstrates up to 23.84times speedup in attention computation and up to 2.28times acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (up to 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.

zFLoRA: Zero-Latency Fused Low-Rank Adapters

Efficient Key-Value (KV) cache management is essential for processing long text sequences in large language models (LLMs), where memory constraints often limit performance. Conventional KV eviction strategies, such as top-k selection based on attention scores, depend on static heuristics that fail to capture the evolving implicit dependencies among tokens during inference. To overcome this, we propose GraphKV, a graph-based framework that redefines token selection for KV cache compression. In GraphKV, textbftokens are modeled as textbfnodes with importance scores, and textbfedges represent their textbfsimilarity relationships. Through a decay-signal-propagation mechanism, token importance is dynamically updated by propagating information across the graph, enabling adaptive retention of the most contextually significant tokens. GraphKV can be seamlessly utilized in existing KV cache eviction methods such as SnapKV and PyramidKV in a plug-and-play manner. Codes are available in the supplementary materials and will be released on Github.

GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction

Data efficiency is crucial in domain-specific continual pre-training (CPT) of large language models (LLMs), especially under resource constraints. Aiming for "small data, big impact," this work addresses the limitations of existing domain-specific data selection strategies, which often rely on scarce labeled data or computationally expensive LLMs. We introduce CDF Sampling with Grammatical Complexity (CDF-GC), an annotation-independent, efficient and interpretable data selection framework for CPT. Our approach comprehensively evaluates grammatical complexity using lexical diversity and syntactic complexity, and employs a cumulative distribution function (CDF)-based sampling strategy to balance complexity and diversity. To validate the effectiveness of CDF-GC, we conducted experiments on a financial dataset. The results demonstrate that CDF-GC significantly outperforms baselines, achieving 2.0% improvement in financial QA at the same selection ratio and even surpassing full-data training by 1.7% using only 20% of the data.

Data-Efficient Selection via Grammatical Complexity in Continual Pre-training of Domain-Specific LLMs

This paper explores the lemmatization of multi-word expressions (MWEs) and proper names in Polish – tasks complicated by linguistic irregularities and historical factors. Instead of using rule-based methods, we apply a machine learning approach with fine-tuned plT5 and mT5 models. We trained and validated the models on enhanced gold-standard data from the 2019 PolEval task and evaluated the impact of additional fine-tuning on a silver-standard dataset derived from Wikipedia. Two setups were tested: one without context, and one using left-side context of the target MWE. Our best model achieved 86.23% AccCS, 89.43% AccCI, and a combined score of 88.79%, setting a new state-of-the-art for Polish MWE and named entity lemmatization, as confirmed by the PolEval maintainers. We also evaluated optimization and quantization techniques to reduce model size and inference time with minimal quality loss.

Lemmatization of Polish Multi-word Expressions

A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts — up to thousands of tokens — via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs — multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space — a property that suggests the potential of learning a dedicated encoder into that space.

Exploring the Hidden Capacity of LLMs for One-Step Text Generation

Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA's superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.

Downloads

Next from EMNLP 2025

Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES