Efficient Knowledge Distillation from Model Checkpoints

Authors: Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song, Gao Huang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments verify its effectiveness and applicability. (from Abstract) and Our contributions are summarized as follows: By designing two exploratory experiments, we observe the phenomenon... Experiments validate its effectiveness and adaptability. (from Introduction and Contributions section).
Researcher Affiliation Academia Chaofei Wang , Qisen Yang , Rui Huang, Shiji Song, Gao Huang Department of Automation, Tsinghua University, China wangcf18, yangqs19, hr20@mails.tsinghua.edu.cn shijis, gaohuang@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 Distillation with the optimal intermediate teacher.
Open Source Code Yes Our code is available at https://github.com/Leap Lab THU/Checkpoint KD. (from Abstract) and Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We provide the URL of code for reproducing the main results. (from Checklist 3.a)
Open Datasets Yes For generality, we conduct experiments on the CIFAR-100 [36], Tiny-Image Net[37] and Image Net [38] datasets with various teacher-student pairs. (from Section 3.2) and We only use open source datasets. (from Checklist 4.d)
Dataset Splits Yes For generality, we conduct experiments on the CIFAR-100 [36], Tiny-Image Net[37] and Image Net [38] datasets with various teacher-student pairs. (Section 3.2) and Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 3.2,Section 4.2,Section 5 and the Appendix. (Checklist 3.b). These datasets have well-known, predefined train/validation/test splits, which are implicitly used here.
Hardware Specification Yes All experiments are implemented by PyTorch and run on TITAN Xp GPUs. (from Appendix A.1)
Software Dependencies No All experiments are implemented by PyTorch and run on TITAN Xp GPUs. (from Appendix A.1). While PyTorch is mentioned, a specific version number is not provided, nor are other software dependencies with versions.
Experiment Setup Yes For fair comparison, we search the optimal hyperparameters (i.e., the loss ratio α and the temperature τ ) for each teacher-student pair. (Section 3.2) and We train each teacher model for 200 epochs to ensure convergence. We save the intermediate models at the 20th, 40th, ..., 180th epochs as intermediate teachers, and the models at the 200th epoch as full teachers. (Section 3.2)