Harmonized Dense Knowledge Distillation Training for Multi-Exit Architectures

Authors: Xinglu Wang, Yingming Li10218-10226

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on CIFAR100 and Image Net show that the HDKD strategy harmoniously improves the performance of the state-of-the-art multi-exit neural networks.
Researcher Affiliation Academia Xinglu Wang, Yingming Li College of Information Science & Electronic Engineering, Zhejiang University, China {xingluwang,yingming}@zju.edu.cn
Pseudocode Yes Algorithm 1 Harmonized Dense Knowledge Distillation Training Procedure for Multi-exit Learning
Open Source Code No The paper does not contain an unambiguous statement that the authors are releasing the code for the work described, nor does it provide a direct link to a source-code repository.
Open Datasets Yes To demonstrate the effectiveness of the proposed training approach, we conduct extensive experiments on two representative image classification datasets, CIFAR100 (Krizhevsky, Nair, and Hinton 2009) and ILSVRC 2012 Image Net (Russakovsky et al. 2015).
Dataset Splits Yes For CIFAR100, the model is trained with a batch size of 128 for 300 epochs. The learning rate is initialized as 0.1, divided by 10 after epochs 150 and 225. For Image Net, we use a larger batch size of 1024 by default instead of 256, following the common setting suggested by (Goyal et al. 2017). ... CIFAR100 dataset contains RGB images of size 32 32, with 50,000 images of 100 classes for training and 10, 000 images for testing. Following (Huang et al. 2017b), we hold out 5,000 training images as a validation set for searching the confidence threshold in budgeted batch classification and supporting the loss weights updating for the HDKD method. ... Image Net dataset contains 1,000 classes, with 1.2 million training images and 50,000 for testing. 50,000 images in the training set are held out and serve as the validation set.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using 'TensorFlow (Abadi et al. 2016) framework' but does not specify the version number or any other software dependencies with their versions.
Experiment Setup Yes Optimization and Hyper-parameters We train all models from the random initialization performed by SGD with a momentum of 0.9, and a weight decay of 10^-4. For CIFAR100, the model is trained with a batch size of 128 for 300 epochs. The learning rate is initialized as 0.1, divided by 10 after epochs 150 and 225. For Image Net, we use a larger batch size of 1024 by default instead of 256, following the common setting suggested by (Goyal et al. 2017).