Learning Student-Friendly Teacher Networks for Knowledge Distillation
Authors: Dae Young Park, Moon-Hyun Cha, changwook jeong, Daesin Kim, Bohyung Han
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform evaluation on multiple well-known datasets including Image Net [30] and CIFAR-100 [31] using several different backbone networks such as Res Net [32], Wide Res Net [33], VGG [34], Shuffle Net V1 [35], and Shuffle Net V2 [36]. For comprehensive evaluation, we adopt various knowledge distillation techniques, which include KD [1], Fit Nets [18], AT [20], SP [37], VID [38], RKD [22], PKT [39], AB [40], FT [21], CRD [23], SSKD [26], and OH [19]. Among these methods, the feature distillation methods [18, 20, 37, 38, 22, 39, 40, 21, 19] conduct joint distillation with conventional KD [1] during student training, which results in higher accuracy in practice than the feature distillation only. We also include comparisons with collaborative learning methods such as DML [4] and KDCL [5], and a curriculum learning technique, RCO [28]. We have reproduced the results from the existing methods using the implementations provided by the authors of the papers. Table 1 and 2 demonstrate the full results on the CIFAR-100 dataset. |
| Researcher Affiliation | Collaboration | Dae Young Park ,1, Moon-Hyun Cha1, Changwook Jeong1, Dae Sin Kim1, and Bohyung Han ,2 1DIT Center, Samsung Electronics, Korea 2ECE & ASRI, Seoul National University, Korea |
| Pseudocode | No | The paper describes its proposed method in detail with text and figures, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions "We have reproduced the results from the existing methods using the implementations provided by the authors of the papers." and cites a GitHub repository (Footnote 2: https://github.com/HobbitLong/RepDistiller) which contains code for *existing methods*, not for the novel methodology presented in this paper. |
| Open Datasets | Yes | We perform evaluation on multiple well-known datasets including Image Net [30] and CIFAR-100 [31]... CIFAR-100 [31] consists of 50K training images and 10K testing images in 100 classes. Image Net [30] consists of 1.2M training images and 50K validation images for 1K classes. |
| Dataset Splits | Yes | CIFAR-100 [31] consists of 50K training images and 10K testing images in 100 classes. Image Net [30] consists of 1.2M training images and 50K validation images for 1K classes. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., specific GPU or CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper states: "We adopt the standard Pytorch set-up for Image Net training for this experiment" and provides a link to a PyTorch example (https://github.com/pytorch/examples/tree/master/imagenet) in Footnote 3. However, it does not specify the version number of PyTorch or any other software libraries or dependencies used. |
| Experiment Setup | Yes | The experiment setup for CIFAR-100 is identical to the one performed in CRD2; most experiments employ the SGD optimizer with learning rate 0.05, weight decay 0.0005 and momentum 0.9 while learning rate is set to 0.01 in the Shuffle Net experiments. The hyperparameters for the loss function are set as λT = 1, λCE R = 1, λKL R = 3, and τ = 1 in student-aware training while τ = 4 in knowledge distillation. ... The optimization is given by SGD with learning rate 0.1, weight decay 0.0001 and momentum 0.9. The coefficients of individual loss terms are set as λT = 1, λCE R = 1, and λKL R = 1, where τ = 1. |