Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis

Authors: Zi Wang10245-10253

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Several student networks trained with these synthesized transfer sets present competitive performance compared to the networks trained with the original training set and other data-free KD approaches. The proposed approach is evaluated with various benchmark network architectures and datasets and exhibits clear improvement over existing works. Specifically, our student networks trained with the proposed approach achieve 99.08% and 93.31% accuracies without using any original training samples by transferring the knowledge from the teacher networks pre-trained on the MNIST and CIFAR10 datasets.
Researcher Affiliation Academia Zi Wang Department of Electrical Engineering and Computer Science, The University of Tennessee zwang84@vols.utk.edu
Pseudocode Yes Algorithm 1 Data-free knowledge distillation for compact student model training
Open Source Code No The paper does not provide any statement or link regarding the public availability of source code for the described methodology.
Open Datasets Yes MNIST is a handwritten digits dataset, which contains 60,000 training samples and 10,000 test samples... (Le Cun et al. 1998) and The CIFAR10 dataset consists of 50,000 training samples and 10,000 test samples... (Krizhevsky, Hinton et al. 2009)
Dataset Splits No The paper mentions '60,000 training samples and 10,000 test samples' for MNIST and '50,000 training samples and 10,000 test samples' for CIFAR-10 but does not explicitly describe a validation split or its size.
Hardware Specification Yes All the experiments are implemented with Tensorflow (Abadi et al. 2016) on an NVIDIA Ge Force RTX 2080 Ti GPU and an Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz.
Software Dependencies No The paper states 'All the experiments are implemented with Tensorflow (Abadi et al. 2016)' but does not specify a version number for Tensorflow or other software dependencies.
Experiment Setup Yes For each configuration, we first train the teacher model with the cross-entropy loss using a stochastic gradient descent (SGD) optimizer with a batch size of 512 for 200 epochs. The initial learning rate is 0.1, which is divided by 10 at epoch 50, 100, and 150, respectively. We optimize the noise inputs by minimizing the KL-divergence between their corresponding softmax outputs with the generated labels with an Adam optimizer (Kingma and Ba 2014) with a learning rate of 0.001 for 1500 iterations. For the activation loss scaling factor λa, we implement a hyperparameter search and report the best performance with λa = 0.05, 0.05, 0.1 for the Le Net-5, Alex Net, and Res Net experiments, respectively. ... A temperature (τ) of 20 is used for both of the sample synthesis and student model training across all architectures.