Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Knowledge Distillation (KD) aims to transfer the dark knowledge that encodes inter-class similarity, semantic structure, and decision boundaries from a powerful teacher model to a compact student model by minimizing the Kullback-Leibler (KL) divergence between their output distributions. While effective, we demonstrate that KL-based KD is designed to match values precisely and does not explicitly constrain the relative relationships between classes. Meanwhile, we empirically find that vanilla KL-based KD suffers from gradient competition due to the zero-sum constraint in the softmax space, which may implicitly change the inter-class rank relationships learned by the student model, particularly under capacity mismatching. Therefore, we argue that the student model should learn not only the probability values but also the relative ranking of classes. Accordingly, we propose a simple yet effective Relative Confidence Knowledge Distillation (RCKD) method that aligns the teacher’s and student’s relative confidence matrices via cosine similarity, achieving more efficient and robust distillation from a stronger teacher model. Extensive experiments demonstrate that RCKD consistently outperforms existing logit-based KD methods and exhibits strong adaptability across various teacher architectures and capacities.