Rethinking Soft Labels for Knowledge Distillation: A Bias–Variance Tradeoff Perspective
Authors: Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, Qian Zhang
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise biasvariance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. |
| Researcher Affiliation | Collaboration | Helong Zhou1 , Liangchen Song2 , Jiajie Chen1 , Ye Zhou1, Guoli Wang13, Junsong Yuan2, Qian Zhang1 1Horizon Robotics 2University at Buffalo 3Tsinghua University |
| Pseudocode | No | The paper includes a computational graph (Figure 3) but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/bellymonster/Weighted-Soft-Label-Distillation. |
| Open Datasets | Yes | The datasets used in our experiments are CIFAR-100 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009). CIFAR-100 contains 50K training and 10K test images of size 32 32. Image Net contains 1.2 million training and 50K validation images. |
| Dataset Splits | Yes | CIFAR-100 contains 50K training and 10K test images of size 32 32. Image Net contains 1.2 million training and 50K validation images. |
| Hardware Specification | No | The paper does not mention any specific hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions general software like PyTorch implicitly (common for deep learning papers in this period), but it does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | For distillation, we set the temperature τ = 4 for CIFAR and τ = 2 for Image Net. For loss function, we set α = 2.25 for distillation on CIFAR and α = 2.5 for Image Net via grid search. The teacher network is well-trained previously and fixed during training. |