Knowledge Distillation Performs Partial Variance Reduction

Authors: Mher Safaryan, Alexandra Peste, Dan Alistarh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.
Researcher Affiliation Academia Mher Safaryan IST Austria mher.safaryan@ista.ac.at Alexandra Peste IST Austria alexandra.peste@ista.ac.at Dan Alistarh IST Austria dan.alistarh@ista.ac.at
Pseudocode Yes Algorithm 1 Knowledge Distillation via SGD
Open Source Code No The paper does not provide any explicit statement or link for open-source code for the methodology described.
Open Datasets Yes Specifically, we consider classification problems using linear models in two different setups: training a linear model on the MNIST dataset [24] and linear probing on the CIFAR-10 dataset [23], using a Res Net50 model [12], pre-trained on the Image Net dataset [42].
Dataset Splits No The paper mentions training on MNIST and CIFAR-10 datasets and evaluating performance, but it does not explicitly provide specific percentages, sample counts, or methodologies for how data was split into training, validation, and test sets. It mentions
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using SGD, but does not specify any software libraries, frameworks, or their version numbers (e.g., PyTorch, TensorFlow, Python version) that would be needed to replicate the experiments.
Experiment Setup Yes In both cases we train using SGD without momentum and regularization, with a fixed learning rate and mini-batch of size 10, for a total of 100 epochs.