reproducibilityindex.ai

Does Knowledge Distillation Really Work?

Authors: Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, Andrew G. Wilson

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Figure 1 we show that with modern architectures knowledge distillation can lead to students with very different predictions from their teachers, even when the student has the capacity to perfectly match the teacher. Indeed, it is becoming well-known that in self-distillation the student fails to match the teacher and, paradoxically, student generalization improves as a result [14, 40]. However, when the teacher is a large model (e.g. a deep ensemble) improvements in ﬁdelity translate into improvements in generalization, as we show in Figure 1(b). For these large models there is still a signiﬁcant accuracy gap between student and teacher, so ﬁdelity is aligned with generalization.
Researcher Affiliation	Collaboration	Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, Google Research, Andrew Gordon Wilson. This research is supported by an Amazon Research Award, NSF I-DISRE 193471, NIH R01DA048764-01A1, NSF IIS-1910266, and NSF 1922658NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. Samuel Stanton is also supported by a United States Department of Defense NDSEG fellowship.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code for all experiments can be found here: https://github.com/samuelstanton/gnosis.
Open Datasets	Yes	We train the teacher on a random subset of 200 examples from the MNIST training set for 100 epochs... We then distill the teacher using the full MNIST train dataset with 60,000 examples, as well as 25%, 50%, and 100% of the EMNIST train dataset [11]... distilling a Res Net-56 teacher trained on CIFAR-100... We provide similar results in Section C.3 for Image Net, showing that our ﬁndings apply to datasets of larger scale and complexity.
Dataset Splits	No	The paper refers to 'test set' and 'training set' (distillation data), but does not provide explicit details about a separate validation split (e.g., specific percentages or sample counts for a validation set).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software like PyTorch [43, 44] and NumPy [19] and Torchvision [36], but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Since we focus on distillation ﬁdelity, we choose = 0 for all experiments in the main text to avoid any confounding from true labels... By default, in our experiments we use stochastic gradient descent (SGD) with momentum, train the student for 300 epochs, and use a weight decay value of 10^-4. In Figure 6 we report the results for the SGD and Adam [27] optimizers run for 1k and 5k epochs without weight decay.