Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We defer discussions of our empirical results to Section 5. However, we highlight some of the empirical findings, as they shall confirm and justify our theoretical approach studying ensemble and knowledge distillation in deep learning. Specifically, we give empirical evidences showing that: Knowledge distillation does not work for random feature mappings; and ensemble in deep learning is very different from ensemble in random feature mappings (see Figure 1). WRN-28-10 on CIFAR10 WRN-28-10 on CIFAR100 Deep Learning: 97.20% 96.70 0.21% 84.69% knowledge distillation / self-distillation accuracies 97.22% / 97.13% 83.81% / 83.56% 96.46% 81.83% 81.51 0.16%
Researcher Affiliation Collaboration Zeyuan Allen-Zhu Meta FAIR Labs zeyuanallenzhu@meta.com Yuanzhi Li Mohamed bin Zayed University of AI Yuanzhi.Li@mbzuai.ac.ae
Pseudocode No The paper describes algorithms and updates using mathematical notation and prose, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code for the methodology or a link to a code repository. It mentions that the full version of the paper is on arXiv.
Open Datasets Yes WRN-28-10 on CIFAR10 WRN-28-10 on CIFAR100
Dataset Splits No The paper mentions training data (Z) and discusses test accuracy, but it does not specify the train/validation/test splits (e.g., percentages or sample counts) for the datasets used in its empirical results.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries with their versions).
Experiment Setup No The paper describes theoretical parameters like learning rate η and number of iterations T in general terms for its proofs, but it does not provide specific numerical hyperparameters (e.g., learning rate = 0.01, batch size = 64) or detailed training configurations for the empirical experiments presented (e.g., in Figure 1).