Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
Authors: Zeyuan Allen-Zhu, Yuanzhi Li
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We defer discussions of our empirical results to Section 5. However, we highlight some of the empirical findings, as they shall confirm and justify our theoretical approach studying ensemble and knowledge distillation in deep learning. Specifically, we give empirical evidences showing that: Knowledge distillation does not work for random feature mappings; and ensemble in deep learning is very different from ensemble in random feature mappings (see Figure 1). WRN-28-10 on CIFAR10 WRN-28-10 on CIFAR100 Deep Learning: 97.20% 96.70 0.21% 84.69% knowledge distillation / self-distillation accuracies 97.22% / 97.13% 83.81% / 83.56% 96.46% 81.83% 81.51 0.16% |
| Researcher Affiliation | Collaboration | Zeyuan Allen-Zhu Meta FAIR Labs zeyuanallenzhu@meta.com Yuanzhi Li Mohamed bin Zayed University of AI Yuanzhi.Li@mbzuai.ac.ae |
| Pseudocode | No | The paper describes algorithms and updates using mathematical notation and prose, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the methodology or a link to a code repository. It mentions that the full version of the paper is on arXiv. |
| Open Datasets | Yes | WRN-28-10 on CIFAR10 WRN-28-10 on CIFAR100 |
| Dataset Splits | No | The paper mentions training data (Z) and discusses test accuracy, but it does not specify the train/validation/test splits (e.g., percentages or sample counts) for the datasets used in its empirical results. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries with their versions). |
| Experiment Setup | No | The paper describes theoretical parameters like learning rate η and number of iterations T in general terms for its proofs, but it does not provide specific numerical hyperparameters (e.g., learning rate = 0.01, batch size = 64) or detailed training configurations for the empirical experiments presented (e.g., in Figure 1). |