Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Knowledge distillation (KD) is widely recognized as an effective approach for compressing large language models (LLMs). However, standard KD methods often falter when confronted with architectural or tokenization heterogeneity between teacher and student models, which creates a mismatch in their representations. While Optimal Transport (OT) provides a promising solution to align these representations, most OT-based methods rely on a single cost function, which isn’t enough to capture the multifaceted discrepancies between models with distinct designs. To address this limitation, we introduce Multi-Cost Wasserstein Knowledge Distillation (MCW-KD), a novel framework that enhances KD by simultaneously optimizing several cost functions within a unified OT formulation. MCW-KD employs specific cost matrices to effectively align both the final hidden states and the output distributions of the models. We also provide a rigorous theoretical foundation for the proposed Multi-Cost Wasserstein Distance, ensuring both mathematical validity and computational ability. Extensive experiments on instruction-following datasets demonstrate that MCW-KD significantly improves student model performance compared to state-of-the-art KD baselines, especially when teacher and student models have different tokenizers.
