Knowledge Distillation as Semiparametric Inference
Authors: Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements. |
| Researcher Affiliation | Collaboration | Tri Dao1, Govinda M. Kamath2, Vasilis Syrgkanis2, Lester Mackey2 1 Department of Computer Science, Stanford University 2 Microsoft Research, New England |
| Pseudocode | No | The paper describes a multi-step procedure for 'cross-fitting' (Partition the dataset into B equally sized folds... For each fold t... Estimate ˆf by minimizing the empirical loss...), but this is presented as descriptive text rather than a formally labeled 'Algorithm' or 'Pseudocode' block. |
| Open Source Code | Yes | Code to replicate all experiments can be found at https://github.com/microsoft/semiparametric-distillation |
| Open Datasets | Yes | On five real tabular datasets, cross-fitting and loss correction improve student performance by up to 4% AUC over vanilla KD. Furthermore, on CIFAR-10 (Krizhevsky & Hinton, 2009), a benchmark image classification dataset... FICO (FIC), Stumble Upon (Eve; Liu et al., 2017), and Adult, Higgs, and MAGIC from Dheeru & Karra Taniskidou (2017). |
| Dataset Splits | Yes | We use cross-fitting with 10 folds. The α hyperparameter of the loss correction was chosen by cross-validation with 5 folds. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'SGD' for training and 'random forest' models, but it does not specify any software frameworks (e.g., PyTorch, TensorFlow) or their specific version numbers required for reproducibility. |
| Experiment Setup | Yes | We use SGD with initial learning rate 0.1, momentum 0.9, and batch size 128 to train for 200 epochs. We use the standard learning rate decay schedule, where the learning rate is divided by 5 at epoch 60, 120, and 160. ... The student is trained using the SEL loss with clipped teacher class probabilities max(ˆp(x),ϵ) for ϵ=10 3. The α hyperparameter of the loss correction was chosen by cross-validation with 5 folds. We repeat the experiments 5 times to measure the mean and standard deviation. |