Zero-shot Knowledge Transfer via Adversarial Belief Matching

Authors: Paul Micaelli, Amos J. Storkey

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show its effectiveness on two common datasets, and 3) we define a measure of belief match between two networks in the vicinity of one s decision boundaries, and demonstrate that our zero-shot student closely matches its teacher. Our distillation results are shown in Figure 2 for a WRN-40-2 teacher and WRN-16-1 student, when using LS as defined in Equation 1. Table 1: Zero shot performance on various WRN teacher and student pairs for CIFAR-10.
Researcher Affiliation Academia Paul Micaelli University of Edinburgh {paul.micaelli}@ed.ac.uk Amos Storkey University of Edinburgh {a.storkey}@ed.ac.uk
Pseudocode Yes Algorithm 1: Zero-shot KT (Section 3.1) and Algorithm 2: Compute transition curves of networks A and B, when stepping across decision boundaries of network A (Section 4.4)
Open Source Code Yes Code is available at: https://github.com/polo5/Zero Shot Knowledge Transfer
Open Datasets Yes We focus our experiments on two common datasets, SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky, 2009).
Dataset Splits No The paper discusses the absence of validation data for their zero-shot method and uses 'M images per class' for few-shot finetuning and baseline comparisons, but does not provide explicit train/validation/test splits with percentages or counts for their experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'Adam (Kingma and Ba, 2015)' for optimization but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes For each iteration we set n G = 1 and n S = 10. We use a generic generator with only three convolutional layers, and our input noise z has 100 dimensions. We use Adam (Kingma and Ba, 2015) with cosine annealing, with an initial learning rate of 2 10 3. We set β = 250 unless otherwise stated. For our baselines, we choose the same settings used to train the teacher and student in the literature, namely SGD with momentum 0.9, and weight decay 5 10 4. We scale the number of epochs such that the number of iterations is the same for all M. The initial learning rate is set to 0.1 and is divided by 5 at 30%, 60%, and 80% of the run.