Feature Kernel Distillation

Authors: Bobby He, Mete Ozay

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we experimentally corroborate our theory in the image classification setting, showing that FKD is amenable to ensemble distillation, can transfer knowledge across datasets, and outperforms both vanilla KD & other feature kernel based KD baselines across a range of standard architectures & datasets.
Researcher Affiliation Collaboration Bobby He1,2 & Mete Ozay2 1Department of Statistics, University of Oxford 2Samsung Research UK
Pseudocode Yes Pseudocode and PyTorch-style code for our FKD implementation are given in Algs. 1 and 2 respectively.
Open Source Code No The paper mentions using external open-source codebases (e.g., “Tian et al. (2020) s excellent open-source PyTorch codebase”), but it does not provide its own source code for the methodology described.
Open Datasets Yes We first verify that larger ensemble teacher size, E, further improves FKD student performance as suggested by Theorem 2. This is confirmed in Fig. 4, using VGG8 for all student & teacher networks on the CIFAR-100 dataset. (...) From a fixed VGG13 teacher network trained on CIFAR100, we distil to student VGG8 NNs on CIFAR-10, STL-10 & Tiny-Image Net.
Dataset Splits Yes For FKD, RKD (Park et al., 2019) and SP (Tung & Mori, 2019), we tuned the learning rate, learning rate decay, and KD regularisation strength λKD on a labeled validation set of size 5000 for CIFAR-10 and 1000 for STL-10, before retraining using best hyperparameters on the full training(+unlabeled) dataset.
Hardware Specification No For STL-10, we used a batch size of 512 for all KD methods regularisation terms, compared to 64 for the standard cross-entropy loss. This was due to the fact that STL-10 has only 5K labeled datapoints, and we wanted to ensure that the student used as much of the unlabeled data as possible for each feature-kernel based KD method s additional regularisation term during 160 epochs of training. 512 batch size was the maximum power of 2 before we ran into memory issues on a 11GB VRAM GPU, which occured for the RKD method. No specific GPU model or processor type is mentioned.
Software Dependencies No The paper mentions using the “PyTorch codebase” and the “Speech Brain Library” but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes 160 epochs training time with batch size 128 and learning rate 0.1 which is decayed by a factor of 10 after epochs 80 and 120. SGD optimiser with momentum 0.9 and weight decay of 0.0001.