Rethinking Soft Labels for Knowledge Distillation: A Bias–Variance Tradeoff Perspective

Authors: Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, Qian Zhang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise biasvariance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method.
Researcher Affiliation Collaboration Helong Zhou1 , Liangchen Song2 , Jiajie Chen1 , Ye Zhou1, Guoli Wang13, Junsong Yuan2, Qian Zhang1 1Horizon Robotics 2University at Buffalo 3Tsinghua University
Pseudocode No The paper includes a computational graph (Figure 3) but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/bellymonster/Weighted-Soft-Label-Distillation.
Open Datasets Yes The datasets used in our experiments are CIFAR-100 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009). CIFAR-100 contains 50K training and 10K test images of size 32 32. Image Net contains 1.2 million training and 50K validation images.
Dataset Splits Yes CIFAR-100 contains 50K training and 10K test images of size 32 32. Image Net contains 1.2 million training and 50K validation images.
Hardware Specification No The paper does not mention any specific hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments.
Software Dependencies No The paper mentions general software like PyTorch implicitly (common for deep learning papers in this period), but it does not specify any software dependencies with version numbers.
Experiment Setup Yes For distillation, we set the temperature τ = 4 for CIFAR and τ = 2 for Image Net. For loss function, we set α = 2.25 for distillation on CIFAR and α = 2.5 for Image Net via grid search. The teacher network is well-trained previously and fixed during training.