Improved Knowledge Distillation via Teacher Assistant
Authors: Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, Hassan Ghasemzadeh5191-5198
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Theoretical analysis and extensive experiments on CIFAR-10,100 and Image Net datasets and on CNN and Res Net architectures substantiate the effectiveness of our proposed approach. We describe in this section the settings of our experiments. Datasets. We perform a set of experiments on two standard datasets CIFAR-10 and CIFAR-100 and one experiment on the large-scale Image Net dataset. |
| Researcher Affiliation | Collaboration | Seyed Iman Mirzadeh, 1 Washington State University, WA, USA 2Deep Mind, CA, USA 3D.E. Shaw, NY, USA 1{seyediman.mirzadeh, hassan.ghasemzadeh}@wsu.edu 2{farajtabar, anglili, nirlevine}@google.com 3akihiro.matsukawa@gmail.com |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Codes and Appendix are available at the following address: https://github.com/imirzadeh/Teacher-Assistant-Knowledge Distillation |
| Open Datasets | Yes | Datasets. We perform a set of experiments on two standard datasets CIFAR-10 and CIFAR-100 and one experiment on the large-scale Image Net dataset. The datasets consist of 32 32 RGB images. The task for all of them is to classify images into image categories. CIFAR-10, CIFAR100 and Image Net contain 10 and 100 and 1000 classes, respectively. |
| Dataset Splits | No | The paper mentions using a hyperparameter optimization toolkit but does not explicitly provide details about the validation dataset splits (percentages, counts, or methodology). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'Microsoft-Research 2018' hyperparameter optimization toolkit but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For optimization, we used stochastic gradient descent with Nesterov momentum of 0.9 and learning rate of 0.1 for 150 epochs. For experiments on plain CNN networks, we used the same learning rate, while for Res Net training we decrease learning rate to 0.01 on epoch 80 and 0.001 on epoch 120. We also used weight decay with the value of 0.0001 for training Res Nets. |