Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study
Authors: Zhiqiang Shen, Zechun Liu, Dejia Xu, Zitian Chen, Kwang-Ting Cheng, Marios Savvides
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classification, Binary Networks, and Neural Machine Translation. |
| Researcher Affiliation | Academia | Zhiqiang Shen CMU Zechun Liu CMU & HKUST Dejia Xu Peking University Zitian Chen UMass Amherst Kwang-Ting Cheng HKUST Marios Savvides CMU |
| Pseudocode | Yes | Algorithm 1 Py Torch-like Code for Calculating Stability Metric. |
| Open Source Code | No | The paper states "Project page: http://zhiqiangshen.com/projects/LS_and_KD/index.html." but does not explicitly provide a direct link to a source-code repository or state that the code is publicly released for the work described. |
| Open Datasets | Yes | We conduct experiments on three datasets: Image Net-1K (Deng et al., 2009), CUB200-2011 (Wah et al., 2011a) and i Materialist product recognition challenge data (in Appendix D). |
| Dataset Splits | Yes | While on validation set the accuracy is comparable or even slightly better (The boosts on CUB is greater than those on Image Net-1K, as shown in Table 2). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper provides Python-like pseudocode but does not list specific software dependencies with version numbers used for the experiments (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For training teacher networks, we follow the standard training protocol (He et al., 2016; Goyal et al., 2017), i.e., total training epoch is 90, initial learning rate is 0.1 and decayed to 1/10 with every 30 epochs. For distillation, as the supervision is a soft distribution and will dynamically change, we train with 200 epochs and the learning rate is multiplied by 0.1 at 80 and 160 epochs. |