Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study

Authors: Zhiqiang Shen, Zechun Liu, Dejia Xu, Zitian Chen, Kwang-Ting Cheng, Marios Savvides

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classification, Binary Networks, and Neural Machine Translation.
Researcher Affiliation Academia Zhiqiang Shen CMU Zechun Liu CMU & HKUST Dejia Xu Peking University Zitian Chen UMass Amherst Kwang-Ting Cheng HKUST Marios Savvides CMU
Pseudocode Yes Algorithm 1 Py Torch-like Code for Calculating Stability Metric.
Open Source Code No The paper states "Project page: http://zhiqiangshen.com/projects/LS_and_KD/index.html." but does not explicitly provide a direct link to a source-code repository or state that the code is publicly released for the work described.
Open Datasets Yes We conduct experiments on three datasets: Image Net-1K (Deng et al., 2009), CUB200-2011 (Wah et al., 2011a) and i Materialist product recognition challenge data (in Appendix D).
Dataset Splits Yes While on validation set the accuracy is comparable or even slightly better (The boosts on CUB is greater than those on Image Net-1K, as shown in Table 2).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper provides Python-like pseudocode but does not list specific software dependencies with version numbers used for the experiments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For training teacher networks, we follow the standard training protocol (He et al., 2016; Goyal et al., 2017), i.e., total training epoch is 90, initial learning rate is 0.1 and decayed to 1/10 with every 30 epochs. For distillation, as the supervision is a soft distribution and will dynamically change, we train with 200 epochs and the learning rate is multiplied by 0.1 at 80 and 160 epochs.