reproducibilityindex.ai

Knowledge Distillation as Semiparametric Inference

Authors: Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our ﬁndings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.
Researcher Affiliation	Collaboration	Tri Dao1, Govinda M. Kamath2, Vasilis Syrgkanis2, Lester Mackey2 1 Department of Computer Science, Stanford University 2 Microsoft Research, New England
Pseudocode	No	The paper describes a multi-step procedure for 'cross-ﬁtting' (Partition the dataset into B equally sized folds... For each fold t... Estimate ˆf by minimizing the empirical loss...), but this is presented as descriptive text rather than a formally labeled 'Algorithm' or 'Pseudocode' block.
Open Source Code	Yes	Code to replicate all experiments can be found at https://github.com/microsoft/semiparametric-distillation
Open Datasets	Yes	On ﬁve real tabular datasets, cross-ﬁtting and loss correction improve student performance by up to 4% AUC over vanilla KD. Furthermore, on CIFAR-10 (Krizhevsky & Hinton, 2009), a benchmark image classiﬁcation dataset... FICO (FIC), Stumble Upon (Eve; Liu et al., 2017), and Adult, Higgs, and MAGIC from Dheeru & Karra Taniskidou (2017).
Dataset Splits	Yes	We use cross-ﬁtting with 10 folds. The α hyperparameter of the loss correction was chosen by cross-validation with 5 folds.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions using 'SGD' for training and 'random forest' models, but it does not specify any software frameworks (e.g., PyTorch, TensorFlow) or their specific version numbers required for reproducibility.
Experiment Setup	Yes	We use SGD with initial learning rate 0.1, momentum 0.9, and batch size 128 to train for 200 epochs. We use the standard learning rate decay schedule, where the learning rate is divided by 5 at epoch 60, 120, and 160. ... The student is trained using the SEL loss with clipped teacher class probabilities max(ˆp(x),ϵ) for ϵ=10 3. The α hyperparameter of the loss correction was chosen by cross-validation with 5 folds. We repeat the experiments 5 times to measure the mean and standard deviation.