reproducibilityindex.ai

Zero-shot Knowledge Transfer via Adversarial Belief Matching

Authors: Paul Micaelli, Amos J. Storkey

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show its effectiveness on two common datasets, and 3) we deﬁne a measure of belief match between two networks in the vicinity of one s decision boundaries, and demonstrate that our zero-shot student closely matches its teacher. Our distillation results are shown in Figure 2 for a WRN-40-2 teacher and WRN-16-1 student, when using LS as deﬁned in Equation 1. Table 1: Zero shot performance on various WRN teacher and student pairs for CIFAR-10.
Researcher Affiliation	Academia	Paul Micaelli University of Edinburgh {paul.micaelli}@ed.ac.uk Amos Storkey University of Edinburgh {a.storkey}@ed.ac.uk
Pseudocode	Yes	Algorithm 1: Zero-shot KT (Section 3.1) and Algorithm 2: Compute transition curves of networks A and B, when stepping across decision boundaries of network A (Section 4.4)
Open Source Code	Yes	Code is available at: https://github.com/polo5/Zero Shot Knowledge Transfer
Open Datasets	Yes	We focus our experiments on two common datasets, SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky, 2009).
Dataset Splits	No	The paper discusses the absence of validation data for their zero-shot method and uses 'M images per class' for few-shot finetuning and baseline comparisons, but does not provide explicit train/validation/test splits with percentages or counts for their experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions using 'Adam (Kingma and Ba, 2015)' for optimization but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	For each iteration we set n G = 1 and n S = 10. We use a generic generator with only three convolutional layers, and our input noise z has 100 dimensions. We use Adam (Kingma and Ba, 2015) with cosine annealing, with an initial learning rate of 2 10 3. We set β = 250 unless otherwise stated. For our baselines, we choose the same settings used to train the teacher and student in the literature, namely SGD with momentum 0.9, and weight decay 5 10 4. We scale the number of epochs such that the number of iterations is the same for all M. The initial learning rate is set to 0.1 and is divided by 5 at 30%, 60%, and 80% of the run.