Efficient Knowledge Distillation from Model Checkpoints
Authors: Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song, Gao Huang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments verify its effectiveness and applicability. (from Abstract) and Our contributions are summarized as follows: By designing two exploratory experiments, we observe the phenomenon... Experiments validate its effectiveness and adaptability. (from Introduction and Contributions section). |
| Researcher Affiliation | Academia | Chaofei Wang , Qisen Yang , Rui Huang, Shiji Song, Gao Huang Department of Automation, Tsinghua University, China wangcf18, yangqs19, hr20@mails.tsinghua.edu.cn shijis, gaohuang@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 Distillation with the optimal intermediate teacher. |
| Open Source Code | Yes | Our code is available at https://github.com/Leap Lab THU/Checkpoint KD. (from Abstract) and Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We provide the URL of code for reproducing the main results. (from Checklist 3.a) |
| Open Datasets | Yes | For generality, we conduct experiments on the CIFAR-100 [36], Tiny-Image Net[37] and Image Net [38] datasets with various teacher-student pairs. (from Section 3.2) and We only use open source datasets. (from Checklist 4.d) |
| Dataset Splits | Yes | For generality, we conduct experiments on the CIFAR-100 [36], Tiny-Image Net[37] and Image Net [38] datasets with various teacher-student pairs. (Section 3.2) and Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 3.2,Section 4.2,Section 5 and the Appendix. (Checklist 3.b). These datasets have well-known, predefined train/validation/test splits, which are implicitly used here. |
| Hardware Specification | Yes | All experiments are implemented by PyTorch and run on TITAN Xp GPUs. (from Appendix A.1) |
| Software Dependencies | No | All experiments are implemented by PyTorch and run on TITAN Xp GPUs. (from Appendix A.1). While PyTorch is mentioned, a specific version number is not provided, nor are other software dependencies with versions. |
| Experiment Setup | Yes | For fair comparison, we search the optimal hyperparameters (i.e., the loss ratio α and the temperature τ ) for each teacher-student pair. (Section 3.2) and We train each teacher model for 200 epochs to ensure convergence. We save the intermediate models at the 20th, 40th, ..., 180th epochs as intermediate teachers, and the models at the 200th epoch as full teachers. (Section 3.2) |