reproducibilityindex.ai

Knowledge Distillation: Bad Models Can Be Good Role Models

Authors: Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments To this point, we saw that getting a sampler (and thus a teacher) from a noisy distribution can be more sample efﬁcient than getting a learner. Furthermore, we showed that we can leverage multiple independent teachers to approximate the Bayes optimal classiﬁer either via ensembling at inference time or via distillation on unlabeled data. We now complement our theoretical results with an experimental evaluation, showing the beneﬁt of using distillation when training on noisy data. While in our theoretical setting we studied teachers that are trained on entirely disjoint training sets, in practice we ﬁnd it more effective to train the teachers on overlapping datasets, as well as training on same dataset with different random initialization. To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-ﬁxed and non-uniform label noise (see full details in D). We see that our teachers achieve 81.3% test accuracy (see Table 1) and behave closely to samplers (see Figure 1) reproducing the results of [20]. We now compare the three methods considered before for using teachers to get learners: 1) Test time Ensembling; 2) Ensemble as distillation teacher and; 3) Random teacher distillation. For distillation, we train a student network on the CIFAR-5m, a large (5-million examples) dataset that resembles the CIFAR-10 dataset [21], where the labels are provided by the previously trained teachers. We report our results in Table 1, where the reported accuracies are on the CIFAR-10 test data. Observe that using an ensemble for inference reduces the noise signiﬁcantly, and achieves test accuracy of 87.8% (versus 81.3% for a single teacher). When applying distillation, both random pseudo-labeling and ensemble pseudo-labeling further increase the test accuracy to about 90%. In addition, we study how the number of teachers affects performance (see Figure 2). We observe that both random pseudo-labeling and ensemble majority improve in performance when the number of teachers grow.
Researcher Affiliation	Collaboration	Gal Kaplun Harvard University & Mobileye galkaplun@g.harvard.edu Eran Malach Hebrew University & Mobileye eran.malach@mail.huji.ac.il Preetum Nakkiran University of California San Diego preetum@ucsd.edu Shai Shalev-Shwartz Hebrew University & Mobileye shais@cs.huji.ac.il
Pseudocode	No	The paper describes algorithms such as Ensemble-Pseudo-Labeling (EPL) and Random-Pseudo-Labeling (RPL) in narrative text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	Yes	1See repository https://github.com/Gal Kaplun/sampler-distillation
Open Datasets	Yes	To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-ﬁxed and non-uniform label noise (see full details in D). (...) For distillation, we train a student network on the CIFAR-5m, a large (5-million examples) dataset that resembles the CIFAR-10 dataset [21], where the labels are provided by the previously trained teachers.
Dataset Splits	No	The paper states training on CIFAR-10 for teachers and CIFAR-5m for the student, and reporting results on CIFAR-10 test data, but it does not explicitly specify the training, validation, and test dataset splits with percentages or sample counts for the experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using ResNet-18 but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup	Yes	To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-ﬁxed and non-uniform label noise (see full details in D). (...) Appendix D: Training Details and Architectures. For all our experiments, we use a ResNet-18 [13] as our base architecture and train it for 200 epochs using SGD with momentum (0.9), and learning rate starting at 0.1 and decreased by a factor of 10 at epochs 100 and 150. We use batch size of 128. For both CIFAR-10 and CIFAR-5m we use standard data augmentation.