reproducibilityindex.ai

Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis

Authors: Zi Wang10245-10253

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Several student networks trained with these synthesized transfer sets present competitive performance compared to the networks trained with the original training set and other data-free KD approaches. The proposed approach is evaluated with various benchmark network architectures and datasets and exhibits clear improvement over existing works. Speciﬁcally, our student networks trained with the proposed approach achieve 99.08% and 93.31% accuracies without using any original training samples by transferring the knowledge from the teacher networks pre-trained on the MNIST and CIFAR10 datasets.
Researcher Affiliation	Academia	Zi Wang Department of Electrical Engineering and Computer Science, The University of Tennessee zwang84@vols.utk.edu
Pseudocode	Yes	Algorithm 1 Data-free knowledge distillation for compact student model training
Open Source Code	No	The paper does not provide any statement or link regarding the public availability of source code for the described methodology.
Open Datasets	Yes	MNIST is a handwritten digits dataset, which contains 60,000 training samples and 10,000 test samples... (Le Cun et al. 1998) and The CIFAR10 dataset consists of 50,000 training samples and 10,000 test samples... (Krizhevsky, Hinton et al. 2009)
Dataset Splits	No	The paper mentions '60,000 training samples and 10,000 test samples' for MNIST and '50,000 training samples and 10,000 test samples' for CIFAR-10 but does not explicitly describe a validation split or its size.
Hardware Specification	Yes	All the experiments are implemented with Tensorﬂow (Abadi et al. 2016) on an NVIDIA Ge Force RTX 2080 Ti GPU and an Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz.
Software Dependencies	No	The paper states 'All the experiments are implemented with Tensorﬂow (Abadi et al. 2016)' but does not specify a version number for Tensorﬂow or other software dependencies.
Experiment Setup	Yes	For each conﬁguration, we ﬁrst train the teacher model with the cross-entropy loss using a stochastic gradient descent (SGD) optimizer with a batch size of 512 for 200 epochs. The initial learning rate is 0.1, which is divided by 10 at epoch 50, 100, and 150, respectively. We optimize the noise inputs by minimizing the KL-divergence between their corresponding softmax outputs with the generated labels with an Adam optimizer (Kingma and Ba 2014) with a learning rate of 0.001 for 1500 iterations. For the activation loss scaling factor λa, we implement a hyperparameter search and report the best performance with λa = 0.05, 0.05, 0.1 for the Le Net-5, Alex Net, and Res Net experiments, respectively. ... A temperature (τ) of 20 is used for both of the sample synthesis and student model training across all architectures.