Does Knowledge Distillation Really Work?

Authors: Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, Andrew G. Wilson

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Figure 1 we show that with modern architectures knowledge distillation can lead to students with very different predictions from their teachers, even when the student has the capacity to perfectly match the teacher. Indeed, it is becoming well-known that in self-distillation the student fails to match the teacher and, paradoxically, student generalization improves as a result [14, 40]. However, when the teacher is a large model (e.g. a deep ensemble) improvements in fidelity translate into improvements in generalization, as we show in Figure 1(b). For these large models there is still a significant accuracy gap between student and teacher, so fidelity is aligned with generalization.
Researcher Affiliation Collaboration Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, Google Research, Andrew Gordon Wilson. This research is supported by an Amazon Research Award, NSF I-DISRE 193471, NIH R01DA048764-01A1, NSF IIS-1910266, and NSF 1922658NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. Samuel Stanton is also supported by a United States Department of Defense NDSEG fellowship.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code for all experiments can be found here: https://github.com/samuelstanton/gnosis.
Open Datasets Yes We train the teacher on a random subset of 200 examples from the MNIST training set for 100 epochs... We then distill the teacher using the full MNIST train dataset with 60,000 examples, as well as 25%, 50%, and 100% of the EMNIST train dataset [11]... distilling a Res Net-56 teacher trained on CIFAR-100... We provide similar results in Section C.3 for Image Net, showing that our findings apply to datasets of larger scale and complexity.
Dataset Splits No The paper refers to 'test set' and 'training set' (distillation data), but does not provide explicit details about a separate validation split (e.g., specific percentages or sample counts for a validation set).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software like PyTorch [43, 44] and NumPy [19] and Torchvision [36], but does not provide specific version numbers for these software components.
Experiment Setup Yes Since we focus on distillation fidelity, we choose = 0 for all experiments in the main text to avoid any confounding from true labels... By default, in our experiments we use stochastic gradient descent (SGD) with momentum, train the student for 300 epochs, and use a weight decay value of 10^-4. In Figure 6 we report the results for the SGD and Adam [27] optimizers run for 1k and 5k epochs without weight decay.