reproducibilityindex.ai

Learning Student-Friendly Teacher Networks for Knowledge Distillation

Authors: Dae Young Park, Moon-Hyun Cha, changwook jeong, Daesin Kim, Bohyung Han

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform evaluation on multiple well-known datasets including Image Net [30] and CIFAR-100 [31] using several different backbone networks such as Res Net [32], Wide Res Net [33], VGG [34], Shufﬂe Net V1 [35], and Shufﬂe Net V2 [36]. For comprehensive evaluation, we adopt various knowledge distillation techniques, which include KD [1], Fit Nets [18], AT [20], SP [37], VID [38], RKD [22], PKT [39], AB [40], FT [21], CRD [23], SSKD [26], and OH [19]. Among these methods, the feature distillation methods [18, 20, 37, 38, 22, 39, 40, 21, 19] conduct joint distillation with conventional KD [1] during student training, which results in higher accuracy in practice than the feature distillation only. We also include comparisons with collaborative learning methods such as DML [4] and KDCL [5], and a curriculum learning technique, RCO [28]. We have reproduced the results from the existing methods using the implementations provided by the authors of the papers. Table 1 and 2 demonstrate the full results on the CIFAR-100 dataset.
Researcher Affiliation	Collaboration	Dae Young Park ,1, Moon-Hyun Cha1, Changwook Jeong1, Dae Sin Kim1, and Bohyung Han ,2 1DIT Center, Samsung Electronics, Korea 2ECE & ASRI, Seoul National University, Korea
Pseudocode	No	The paper describes its proposed method in detail with text and figures, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions "We have reproduced the results from the existing methods using the implementations provided by the authors of the papers." and cites a GitHub repository (Footnote 2: https://github.com/HobbitLong/RepDistiller) which contains code for existing methods, not for the novel methodology presented in this paper.
Open Datasets	Yes	We perform evaluation on multiple well-known datasets including Image Net [30] and CIFAR-100 [31]... CIFAR-100 [31] consists of 50K training images and 10K testing images in 100 classes. Image Net [30] consists of 1.2M training images and 50K validation images for 1K classes.
Dataset Splits	Yes	CIFAR-100 [31] consists of 50K training images and 10K testing images in 100 classes. Image Net [30] consists of 1.2M training images and 50K validation images for 1K classes.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., specific GPU or CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper states: "We adopt the standard Pytorch set-up for Image Net training for this experiment" and provides a link to a PyTorch example (https://github.com/pytorch/examples/tree/master/imagenet) in Footnote 3. However, it does not specify the version number of PyTorch or any other software libraries or dependencies used.
Experiment Setup	Yes	The experiment setup for CIFAR-100 is identical to the one performed in CRD2; most experiments employ the SGD optimizer with learning rate 0.05, weight decay 0.0005 and momentum 0.9 while learning rate is set to 0.01 in the Shufﬂe Net experiments. The hyperparameters for the loss function are set as λT = 1, λCE R = 1, λKL R = 3, and τ = 1 in student-aware training while τ = 4 in knowledge distillation. ... The optimization is given by SGD with learning rate 0.1, weight decay 0.0001 and momentum 0.9. The coefﬁcients of individual loss terms are set as λT = 1, λCE R = 1, and λKL R = 1, where τ = 1.