Knowledge Distillation as Semiparametric Inference

Authors: Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.
Researcher Affiliation Collaboration Tri Dao1, Govinda M. Kamath2, Vasilis Syrgkanis2, Lester Mackey2 1 Department of Computer Science, Stanford University 2 Microsoft Research, New England
Pseudocode No The paper describes a multi-step procedure for 'cross-fitting' (Partition the dataset into B equally sized folds... For each fold t... Estimate ˆf by minimizing the empirical loss...), but this is presented as descriptive text rather than a formally labeled 'Algorithm' or 'Pseudocode' block.
Open Source Code Yes Code to replicate all experiments can be found at https://github.com/microsoft/semiparametric-distillation
Open Datasets Yes On five real tabular datasets, cross-fitting and loss correction improve student performance by up to 4% AUC over vanilla KD. Furthermore, on CIFAR-10 (Krizhevsky & Hinton, 2009), a benchmark image classification dataset... FICO (FIC), Stumble Upon (Eve; Liu et al., 2017), and Adult, Higgs, and MAGIC from Dheeru & Karra Taniskidou (2017).
Dataset Splits Yes We use cross-fitting with 10 folds. The α hyperparameter of the loss correction was chosen by cross-validation with 5 folds.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using 'SGD' for training and 'random forest' models, but it does not specify any software frameworks (e.g., PyTorch, TensorFlow) or their specific version numbers required for reproducibility.
Experiment Setup Yes We use SGD with initial learning rate 0.1, momentum 0.9, and batch size 128 to train for 200 epochs. We use the standard learning rate decay schedule, where the learning rate is divided by 5 at epoch 60, 120, and 160. ... The student is trained using the SEL loss with clipped teacher class probabilities max(ˆp(x),ϵ) for ϵ=10 3. The α hyperparameter of the loss correction was chosen by cross-validation with 5 folds. We repeat the experiments 5 times to measure the mean and standard deviation.