Analyzing the Confidentiality of Undistillable Teachers in Knowledge Distillation
Authors: Souvik Kundu, Qirui Sun, Yao Fu, Massoud Pedram, Peter Beerel
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experimental ResultsWe conducted extensive experiments using both standard KD with available training data on CIFAR-10, CIFAR-100, and Tiny-Image Net, and data-free KD on CIFAR-10 testset. Experimental results show that compared to normal ones, skeptical students exhibit improved performance of up to 59.5% and 5.8% for data-available and data-free KD, respectively, when distilled from nasty teachers. |
| Researcher Affiliation | Academia | Electrical and Computer Engineering University of Southern California Los Angeles, CA 90089 {souvikku, qiruisun, yaof, pedram, pabeerel}@usc.edu |
| Pseudocode | No | The paper describes its methods using textual descriptions and mathematical equations, but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We have open-sourced the code at github.com/ksouvik52/Skeptical2021. |
| Open Datasets | Yes | We conduct extensive experiments using both standard KD with available training data on CIFAR-10, CIFAR-100, and Tiny-Image Net, and data-free KD on CIFAR-10 testset. [12] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [8] Lucas Hansen. Tiny Image Net challenge submission. CS 231N, 2015. |
| Dataset Splits | No | The paper states it uses CIFAR-10, CIFAR-100, and Tiny-Image Net datasets and mentions data-free scenarios, but it does not explicitly provide specific percentages or counts for training, validation, and test dataset splits. |
| Hardware Specification | Yes | We used Py Torch API to define and train our models on an Nvidia RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions using 'Py Torch API' but does not specify a version number for it or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Training hyperparameters. We used standard data augmentation techniques (horizontal flip and random crop with reflective padding) and the SGD optimizer for all training. To create a nasty teacher, we first trained a network ΦA for 160 epochs on CIFAR-10 and 200 epochs for CIFAR-100 and Tiny-Image Net with an initial learning rate (LR) of 0.1 for all. For CIFAR-10, we reduced the LR by a factor of 0.1 after 80 and 120 epochs. For CIFAR-100 and Tiny-Image Net the LR decayed at 60, 120, and 160 epochs by a factor of 0.2. We chose αN as 0.04, 0.005 and 0.005, for CIFAR-10, CIFAR-100, and Tiny-Image Net, respectively [18]. Similar to [18], we chose τN to be 4, 20, and 20 for the three datasets. For the distillation training to ΦS (both normal and skeptical), we trained for 180 epochs with a starting LR of 0.05 that decays by a factor of 0.1 after 120, 150, and 170 epochs. Unless stated otherwise, we kept τ the same as τN and chose α and β to be 0.9 and 0.7, respectively. We placed the skeptical students auxiliary classifiers after the 2nd (Φ S for KD from the teacher) and 3rd (for SD) BB of a total of 4 Res Net blocks. To give equal weight to the loss components of Eq. 3, we chose γ1 = γ2 = γ3 =1.0, for all the experiments. We performed all the experiments with two different seeds and report the average accuracy with std deviation (in bracket) in the tables. |