reproducibilityindex.ai

Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study

Authors: Zhiqiang Shen, Zechun Liu, Dejia Xu, Zitian Chen, Kwang-Ting Cheng, Marios Savvides

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classiﬁcation, Binary Networks, and Neural Machine Translation.
Researcher Affiliation	Academia	Zhiqiang Shen CMU Zechun Liu CMU & HKUST Dejia Xu Peking University Zitian Chen UMass Amherst Kwang-Ting Cheng HKUST Marios Savvides CMU
Pseudocode	Yes	Algorithm 1 Py Torch-like Code for Calculating Stability Metric.
Open Source Code	No	The paper states "Project page: http://zhiqiangshen.com/projects/LS_and_KD/index.html." but does not explicitly provide a direct link to a source-code repository or state that the code is publicly released for the work described.
Open Datasets	Yes	We conduct experiments on three datasets: Image Net-1K (Deng et al., 2009), CUB200-2011 (Wah et al., 2011a) and i Materialist product recognition challenge data (in Appendix D).
Dataset Splits	Yes	While on validation set the accuracy is comparable or even slightly better (The boosts on CUB is greater than those on Image Net-1K, as shown in Table 2).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper provides Python-like pseudocode but does not list specific software dependencies with version numbers used for the experiments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For training teacher networks, we follow the standard training protocol (He et al., 2016; Goyal et al., 2017), i.e., total training epoch is 90, initial learning rate is 0.1 and decayed to 1/10 with every 30 epochs. For distillation, as the supervision is a soft distribution and will dynamically change, we train with 200 epochs and the learning rate is multiplied by 0.1 at 80 and 160 epochs.