Progressive Blockwise Knowledge Distillation for Neural Network Acceleration

Authors: Hui Wang, Hanbin Zhao, Xi Li, Xu Tan

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches.
Researcher Affiliation Collaboration 1 Zhejiang University, Hangzhou, China 2 2012 Lab, Huawei Technologies, Hangzhou, China
Pseudocode Yes The details of our progressive blockwise learning scheme are shown in Alg. 1.
Open Source Code No The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement) for the source code of the described methodology.
Open Datasets Yes CIFAR10 [Krizhevsky and Hinton, 2009] is a labeled subset of the 80 million tiny images dataset for object recognition. This dataset contains 60000 32 32 RGB images in 10 classes, with 5000 images per class for training and 1000 images per class for testing. CIFAR100 [Krizhevsky and Hinton, 2009] is also a labeled subset of the 80 million tiny images dataset for object recognition. It contains 100 classes including 600 32 32 images each, with 500 images for training and 100 images for testing. Image Net [Krizhevsky et al., 2012] is a dataset for Image Net Large Scale Visual Recognition Challenge 2012. It contains 1.28 million training images and 50k validation images in 1000 classes.
Dataset Splits Yes CIFAR10 [...] with 5000 images per class for training and 1000 images per class for testing. CIFAR100 [...] with 500 images for training and 100 images for testing. Image Net [...] 1.28 million training images and 50k validation images in 1000 classes.
Hardware Specification Yes We implement our architecture using Caffe [Jia et al., 2014] and use an NVIDIA TITAN X GPU to train the network.
Software Dependencies No The paper mentions using 'Caffe' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes On CIFAR10 and CIFAR100, we use SGD with a minibatch size of 100 at each block learning stage. The initial learning rate is set to 0.01 and is divided by 10 after 3 epochs. We train the network using a weight decay of 0.005 and a momentum of 0.9. From our experiments, we notice that each learning stage converges in less than 6 epochs. And so we terminate each learning stage after 6 epochs. On Image Net, we use SGD with a mini-batch size of 32 at each block learning stage. The momentum parameter is chosen as 0.9, the initial learning rate is set to 0.01 and the weight decay is 0.0005. It takes 50000 training iterations for our method to converge at every learning stage.