Harmonized Dense Knowledge Distillation Training for Multi-Exit Architectures
Authors: Xinglu Wang, Yingming Li10218-10226
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on CIFAR100 and Image Net show that the HDKD strategy harmoniously improves the performance of the state-of-the-art multi-exit neural networks. |
| Researcher Affiliation | Academia | Xinglu Wang, Yingming Li College of Information Science & Electronic Engineering, Zhejiang University, China {xingluwang,yingming}@zju.edu.cn |
| Pseudocode | Yes | Algorithm 1 Harmonized Dense Knowledge Distillation Training Procedure for Multi-exit Learning |
| Open Source Code | No | The paper does not contain an unambiguous statement that the authors are releasing the code for the work described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | To demonstrate the effectiveness of the proposed training approach, we conduct extensive experiments on two representative image classification datasets, CIFAR100 (Krizhevsky, Nair, and Hinton 2009) and ILSVRC 2012 Image Net (Russakovsky et al. 2015). |
| Dataset Splits | Yes | For CIFAR100, the model is trained with a batch size of 128 for 300 epochs. The learning rate is initialized as 0.1, divided by 10 after epochs 150 and 225. For Image Net, we use a larger batch size of 1024 by default instead of 256, following the common setting suggested by (Goyal et al. 2017). ... CIFAR100 dataset contains RGB images of size 32 32, with 50,000 images of 100 classes for training and 10, 000 images for testing. Following (Huang et al. 2017b), we hold out 5,000 training images as a validation set for searching the confidence threshold in budgeted batch classification and supporting the loss weights updating for the HDKD method. ... Image Net dataset contains 1,000 classes, with 1.2 million training images and 50,000 for testing. 50,000 images in the training set are held out and serve as the validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'TensorFlow (Abadi et al. 2016) framework' but does not specify the version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | Optimization and Hyper-parameters We train all models from the random initialization performed by SGD with a momentum of 0.9, and a weight decay of 10^-4. For CIFAR100, the model is trained with a batch size of 128 for 300 epochs. The learning rate is initialized as 0.1, divided by 10 after epochs 150 and 225. For Image Net, we use a larger batch size of 1024 by default instead of 256, following the common setting suggested by (Goyal et al. 2017). |