Feature Kernel Distillation
Authors: Bobby He, Mete Ozay
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we experimentally corroborate our theory in the image classification setting, showing that FKD is amenable to ensemble distillation, can transfer knowledge across datasets, and outperforms both vanilla KD & other feature kernel based KD baselines across a range of standard architectures & datasets. |
| Researcher Affiliation | Collaboration | Bobby He1,2 & Mete Ozay2 1Department of Statistics, University of Oxford 2Samsung Research UK |
| Pseudocode | Yes | Pseudocode and PyTorch-style code for our FKD implementation are given in Algs. 1 and 2 respectively. |
| Open Source Code | No | The paper mentions using external open-source codebases (e.g., “Tian et al. (2020) s excellent open-source PyTorch codebase”), but it does not provide its own source code for the methodology described. |
| Open Datasets | Yes | We first verify that larger ensemble teacher size, E, further improves FKD student performance as suggested by Theorem 2. This is confirmed in Fig. 4, using VGG8 for all student & teacher networks on the CIFAR-100 dataset. (...) From a fixed VGG13 teacher network trained on CIFAR100, we distil to student VGG8 NNs on CIFAR-10, STL-10 & Tiny-Image Net. |
| Dataset Splits | Yes | For FKD, RKD (Park et al., 2019) and SP (Tung & Mori, 2019), we tuned the learning rate, learning rate decay, and KD regularisation strength λKD on a labeled validation set of size 5000 for CIFAR-10 and 1000 for STL-10, before retraining using best hyperparameters on the full training(+unlabeled) dataset. |
| Hardware Specification | No | For STL-10, we used a batch size of 512 for all KD methods regularisation terms, compared to 64 for the standard cross-entropy loss. This was due to the fact that STL-10 has only 5K labeled datapoints, and we wanted to ensure that the student used as much of the unlabeled data as possible for each feature-kernel based KD method s additional regularisation term during 160 epochs of training. 512 batch size was the maximum power of 2 before we ran into memory issues on a 11GB VRAM GPU, which occured for the RKD method. No specific GPU model or processor type is mentioned. |
| Software Dependencies | No | The paper mentions using the “PyTorch codebase” and the “Speech Brain Library” but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | 160 epochs training time with batch size 128 and learning rate 0.1 which is decayed by a factor of 10 after epochs 80 and 120. SGD optimiser with momentum 0.9 and weight decay of 0.0001. |