Toward Student-oriented Teacher Network Training for Knowledge Distillation
Authors: Chengyu Dong, Liyuan Liu, Jingbo Shang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that So Teacher can improve student accuracy consistently. 5 EXPERIMENTS In this section, we evaluate the effectiveness of our teacher training method in knowledge distillation. We focus on compressing a large network to a smaller one where the student is trained on the same data set as the teacher. Tables 1 and 2 show the evaluation results on CIFAR-100 and Tiny-Image Net/Image Net, respectively. |
| Researcher Affiliation | Collaboration | Chengyu Dong1 Liyuan Liu2 Jingbo Shang1 1University of California, San Diego 2Microsoft Research |
| Pseudocode | No | The paper describes methods through mathematical formulations and prose, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper refers to 'existing repositories for knowledge distillation (Tian et al., 2020; Shah et al., 2020; Matsubara, 2021) and author-provided codes' and footnote 2 mentions 'https://github.com/HobbitLong/RepDistiller'. However, it does not explicitly state that the source code for *their* specific method (So Teacher) is released or provide a direct link to it. |
| Open Datasets | Yes | We conduct experiments on benchmark datasets including CIFAR100 (Krizhevsky, 2009), Tiny-Image Net (Tin, 2017), and Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper mentions that 'the optimal temperature is located on an additional holdout set (Guo et al., 2017)' for temperature scaling, implying the use of a validation set. However, it does not provide specific details on the split percentages or sample counts for this holdout set, which is required for reproducibility. |
| Hardware Specification | No | The paper describes the training process and hyperparameters but does not provide specific details about the hardware used, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions 'SGD as the optimizer', 'torchdistill', and 'Rep Distiller 2' but does not specify version numbers for these software components or other libraries (e.g., Python, PyTorch, CUDA versions) that would be necessary for replication. |
| Experiment Setup | Yes | For all the experiments on CIFAR-100, we employ SGD as the optimizer and train for 240 epochs with a batch size of 64. The learning rate is initialized at 0.05 and decayed by a factor of 10 at the epochs 150, 180 and 210, with an exception for Shuffle Net where the learning rate is initialized at 0.01... The weight decay and momentum are fixed as 0.0005 and 0.9 respectively. For Tiny-Image Net experiments, we employ SGD as the optimizer and conduct the teacher training for 90 epochs with a batch size of 128. The learning rate starts at 0.1 and is decayed by a factor of 10 at epochs 30 and 60. The weight decay and momentum are fixed as 0.0005 and 0.9 respectively. |