Data-Distortion Guided Self-Distillation for Deep Neural Networks

Authors: Ting-Bing Xu, Cheng-Lin Liu5565-5572

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple datasets (i.e., CIFAR-10/100 and Image Net) demonstrate that the proposed method can effectively improve the generalization performance of various network architectures (such as Alex Net, Res Net, Wide Res Net, and Dense Net), outperform existing distillation methods with little extra training efforts.
Researcher Affiliation Academia 1National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3CAS Center for Excellence of Brain Science and Intelligence Technology, Beijing, China
Pseudocode No The paper describes the training process in text and diagrams (Figure 2) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/Tongcheng/caffe/.
Open Datasets Yes We follow state-of-the-art networks on CIFAR datasets (Krizhevsky and Hinton 2009). The CIFAR-10 consists of 32x32 colour images in 10 classes including 50000 training samples and 10000 test samples... Additionally, we also follow the classical Alex Net (Krizhevsky, Sutskever, and Hinton 2012) and Res Net-18 (He et al. 2016a) networks on the large-scale Image Net-2012 dataset (Russakovsky et al. 2015)...
Dataset Splits Yes The CIFAR-10 consists of ... 50000 training samples and 10000 test samples... The CIFAR-100 is a more challenging recognition task... training and test sets are also 50000 and 10000 colored natural scene images... Image Net-2012 dataset... contains about 1.3 million training images and 50000 validation images from 1000 classes.
Hardware Specification No The paper states 'All the experiments are performed with the high-efficiency caffe (Jia et al. 2014) platform' but does not provide specific hardware details such as GPU or CPU models.
Software Dependencies No The paper mentions using 'caffe' but does not specify a version number for it or other software dependencies.
Experiment Setup Yes Specifically, we perform training processes of Res Net-32/110... where it sets mini-batch size to 64, weight decay to 10^-4 and initial learning rate to 0.1 (dropped by 0.1 every 60 epochs and trained for 200 epochs). For Wide Res Net... set 128 samples per mini-batch, weight decay to 5x10^-4 and initial learning rate to 0.1 (dropped by 0.2 after 60, 120 and 180 epochs and trained for 200 epochs). For Dense Net... use weight decay of 10^-4, mini-batch size of 64 for 300 epochs and initial learning rate of 0.1 (divided by 10 at 50% and 75% of the total number of training epochs). All the networks are trained using Nesterov momentum (Sutskever et al. 2013) of 0.9 and weight initialization of msra in (He et al. 2015).