Knowledge Distillation: Bad Models Can Be Good Role Models

Authors: Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments To this point, we saw that getting a sampler (and thus a teacher) from a noisy distribution can be more sample efficient than getting a learner. Furthermore, we showed that we can leverage multiple independent teachers to approximate the Bayes optimal classifier either via ensembling at inference time or via distillation on unlabeled data. We now complement our theoretical results with an experimental evaluation, showing the benefit of using distillation when training on noisy data. While in our theoretical setting we studied teachers that are trained on entirely disjoint training sets, in practice we find it more effective to train the teachers on overlapping datasets, as well as training on same dataset with different random initialization. To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-fixed and non-uniform label noise (see full details in D). We see that our teachers achieve 81.3% test accuracy (see Table 1) and behave closely to samplers (see Figure 1) reproducing the results of [20]. We now compare the three methods considered before for using teachers to get learners: 1) Test time Ensembling; 2) Ensemble as distillation teacher and; 3) Random teacher distillation. For distillation, we train a student network on the CIFAR-5m, a large (5-million examples) dataset that resembles the CIFAR-10 dataset [21], where the labels are provided by the previously trained teachers. We report our results in Table 1, where the reported accuracies are on the CIFAR-10 test data. Observe that using an ensemble for inference reduces the noise significantly, and achieves test accuracy of 87.8% (versus 81.3% for a single teacher). When applying distillation, both random pseudo-labeling and ensemble pseudo-labeling further increase the test accuracy to about 90%. In addition, we study how the number of teachers affects performance (see Figure 2). We observe that both random pseudo-labeling and ensemble majority improve in performance when the number of teachers grow.
Researcher Affiliation Collaboration Gal Kaplun Harvard University & Mobileye galkaplun@g.harvard.edu Eran Malach Hebrew University & Mobileye eran.malach@mail.huji.ac.il Preetum Nakkiran University of California San Diego preetum@ucsd.edu Shai Shalev-Shwartz Hebrew University & Mobileye shais@cs.huji.ac.il
Pseudocode No The paper describes algorithms such as Ensemble-Pseudo-Labeling (EPL) and Random-Pseudo-Labeling (RPL) in narrative text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code Yes 1See repository https://github.com/Gal Kaplun/sampler-distillation
Open Datasets Yes To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-fixed and non-uniform label noise (see full details in D). (...) For distillation, we train a student network on the CIFAR-5m, a large (5-million examples) dataset that resembles the CIFAR-10 dataset [21], where the labels are provided by the previously trained teachers.
Dataset Splits No The paper states training on CIFAR-10 for teachers and CIFAR-5m for the student, and reporting results on CIFAR-10 test data, but it does not explicitly specify the training, validation, and test dataset splits with percentages or sample counts for the experiments.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using ResNet-18 but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-fixed and non-uniform label noise (see full details in D). (...) Appendix D: Training Details and Architectures. For all our experiments, we use a ResNet-18 [13] as our base architecture and train it for 200 epochs using SGD with momentum (0.9), and learning rate starting at 0.1 and decreased by a factor of 10 at epochs 100 and 150. We use batch size of 128. For both CIFAR-10 and CIFAR-5m we use standard data augmentation.