Analyzing the Confidentiality of Undistillable Teachers in Knowledge Distillation

Authors: Souvik Kundu, Qirui Sun, Yao Fu, Massoud Pedram, Peter Beerel

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experimental ResultsWe conducted extensive experiments using both standard KD with available training data on CIFAR-10, CIFAR-100, and Tiny-Image Net, and data-free KD on CIFAR-10 testset. Experimental results show that compared to normal ones, skeptical students exhibit improved performance of up to 59.5% and 5.8% for data-available and data-free KD, respectively, when distilled from nasty teachers.
Researcher Affiliation Academia Electrical and Computer Engineering University of Southern California Los Angeles, CA 90089 {souvikku, qiruisun, yaof, pedram, pabeerel}@usc.edu
Pseudocode No The paper describes its methods using textual descriptions and mathematical equations, but it does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We have open-sourced the code at github.com/ksouvik52/Skeptical2021.
Open Datasets Yes We conduct extensive experiments using both standard KD with available training data on CIFAR-10, CIFAR-100, and Tiny-Image Net, and data-free KD on CIFAR-10 testset. [12] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [8] Lucas Hansen. Tiny Image Net challenge submission. CS 231N, 2015.
Dataset Splits No The paper states it uses CIFAR-10, CIFAR-100, and Tiny-Image Net datasets and mentions data-free scenarios, but it does not explicitly provide specific percentages or counts for training, validation, and test dataset splits.
Hardware Specification Yes We used Py Torch API to define and train our models on an Nvidia RTX 2080 Ti GPU.
Software Dependencies No The paper mentions using 'Py Torch API' but does not specify a version number for it or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes Training hyperparameters. We used standard data augmentation techniques (horizontal flip and random crop with reflective padding) and the SGD optimizer for all training. To create a nasty teacher, we first trained a network ΦA for 160 epochs on CIFAR-10 and 200 epochs for CIFAR-100 and Tiny-Image Net with an initial learning rate (LR) of 0.1 for all. For CIFAR-10, we reduced the LR by a factor of 0.1 after 80 and 120 epochs. For CIFAR-100 and Tiny-Image Net the LR decayed at 60, 120, and 160 epochs by a factor of 0.2. We chose αN as 0.04, 0.005 and 0.005, for CIFAR-10, CIFAR-100, and Tiny-Image Net, respectively [18]. Similar to [18], we chose τN to be 4, 20, and 20 for the three datasets. For the distillation training to ΦS (both normal and skeptical), we trained for 180 epochs with a starting LR of 0.05 that decays by a factor of 0.1 after 120, 150, and 170 epochs. Unless stated otherwise, we kept τ the same as τN and chose α and β to be 0.9 and 0.7, respectively. We placed the skeptical students auxiliary classifiers after the 2nd (Φ S for KD from the teacher) and 3rd (for SD) BB of a total of 4 Res Net blocks. To give equal weight to the loss components of Eq. 3, we chose γ1 = γ2 = γ3 =1.0, for all the experiments. We performed all the experiments with two different seeds and report the average accuracy with std deviation (in bracket) in the tables.