reproducibilityindex.ai

Knowledge Distillation Performs Partial Variance Reduction

Authors: Mher Safaryan, Alexandra Peste, Dan Alistarh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.
Researcher Affiliation	Academia	Mher Safaryan IST Austria mher.safaryan@ista.ac.at Alexandra Peste IST Austria alexandra.peste@ista.ac.at Dan Alistarh IST Austria dan.alistarh@ista.ac.at
Pseudocode	Yes	Algorithm 1 Knowledge Distillation via SGD
Open Source Code	No	The paper does not provide any explicit statement or link for open-source code for the methodology described.
Open Datasets	Yes	Specifically, we consider classification problems using linear models in two different setups: training a linear model on the MNIST dataset [24] and linear probing on the CIFAR-10 dataset [23], using a Res Net50 model [12], pre-trained on the Image Net dataset [42].
Dataset Splits	No	The paper mentions training on MNIST and CIFAR-10 datasets and evaluating performance, but it does not explicitly provide specific percentages, sample counts, or methodologies for how data was split into training, validation, and test sets. It mentions
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using SGD, but does not specify any software libraries, frameworks, or their version numbers (e.g., PyTorch, TensorFlow, Python version) that would be needed to replicate the experiments.
Experiment Setup	Yes	In both cases we train using SGD without momentum and regularization, with a fixed learning rate and mini-batch of size 10, for a total of 100 epochs.