Knowledge Distillation: Bad Models Can Be Good Role Models
Authors: Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments To this point, we saw that getting a sampler (and thus a teacher) from a noisy distribution can be more sample efficient than getting a learner. Furthermore, we showed that we can leverage multiple independent teachers to approximate the Bayes optimal classifier either via ensembling at inference time or via distillation on unlabeled data. We now complement our theoretical results with an experimental evaluation, showing the benefit of using distillation when training on noisy data. While in our theoretical setting we studied teachers that are trained on entirely disjoint training sets, in practice we find it more effective to train the teachers on overlapping datasets, as well as training on same dataset with different random initialization. To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-fixed and non-uniform label noise (see full details in D). We see that our teachers achieve 81.3% test accuracy (see Table 1) and behave closely to samplers (see Figure 1) reproducing the results of [20]. We now compare the three methods considered before for using teachers to get learners: 1) Test time Ensembling; 2) Ensemble as distillation teacher and; 3) Random teacher distillation. For distillation, we train a student network on the CIFAR-5m, a large (5-million examples) dataset that resembles the CIFAR-10 dataset [21], where the labels are provided by the previously trained teachers. We report our results in Table 1, where the reported accuracies are on the CIFAR-10 test data. Observe that using an ensemble for inference reduces the noise significantly, and achieves test accuracy of 87.8% (versus 81.3% for a single teacher). When applying distillation, both random pseudo-labeling and ensemble pseudo-labeling further increase the test accuracy to about 90%. In addition, we study how the number of teachers affects performance (see Figure 2). We observe that both random pseudo-labeling and ensemble majority improve in performance when the number of teachers grow. |
| Researcher Affiliation | Collaboration | Gal Kaplun Harvard University & Mobileye galkaplun@g.harvard.edu Eran Malach Hebrew University & Mobileye eran.malach@mail.huji.ac.il Preetum Nakkiran University of California San Diego preetum@ucsd.edu Shai Shalev-Shwartz Hebrew University & Mobileye shais@cs.huji.ac.il |
| Pseudocode | No | The paper describes algorithms such as Ensemble-Pseudo-Labeling (EPL) and Random-Pseudo-Labeling (RPL) in narrative text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | 1See repository https://github.com/Gal Kaplun/sampler-distillation |
| Open Datasets | Yes | To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-fixed and non-uniform label noise (see full details in D). (...) For distillation, we train a student network on the CIFAR-5m, a large (5-million examples) dataset that resembles the CIFAR-10 dataset [21], where the labels are provided by the previously trained teachers. |
| Dataset Splits | No | The paper states training on CIFAR-10 for teachers and CIFAR-5m for the student, and reporting results on CIFAR-10 test data, but it does not explicitly specify the training, validation, and test dataset splits with percentages or sample counts for the experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using ResNet-18 but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | To get the teachers, we train a Res Net-18 [13] on CIFAR-10 with 20%-fixed and non-uniform label noise (see full details in D). (...) Appendix D: Training Details and Architectures. For all our experiments, we use a ResNet-18 [13] as our base architecture and train it for 200 epochs using SGD with momentum (0.9), and learning rate starting at 0.1 and decreased by a factor of 10 at epochs 100 and 150. We use batch size of 128. For both CIFAR-10 and CIFAR-5m we use standard data augmentation. |