Zero-shot Knowledge Transfer via Adversarial Belief Matching
Authors: Paul Micaelli, Amos J. Storkey
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show its effectiveness on two common datasets, and 3) we define a measure of belief match between two networks in the vicinity of one s decision boundaries, and demonstrate that our zero-shot student closely matches its teacher. Our distillation results are shown in Figure 2 for a WRN-40-2 teacher and WRN-16-1 student, when using LS as defined in Equation 1. Table 1: Zero shot performance on various WRN teacher and student pairs for CIFAR-10. |
| Researcher Affiliation | Academia | Paul Micaelli University of Edinburgh {paul.micaelli}@ed.ac.uk Amos Storkey University of Edinburgh {a.storkey}@ed.ac.uk |
| Pseudocode | Yes | Algorithm 1: Zero-shot KT (Section 3.1) and Algorithm 2: Compute transition curves of networks A and B, when stepping across decision boundaries of network A (Section 4.4) |
| Open Source Code | Yes | Code is available at: https://github.com/polo5/Zero Shot Knowledge Transfer |
| Open Datasets | Yes | We focus our experiments on two common datasets, SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky, 2009). |
| Dataset Splits | No | The paper discusses the absence of validation data for their zero-shot method and uses 'M images per class' for few-shot finetuning and baseline comparisons, but does not provide explicit train/validation/test splits with percentages or counts for their experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma and Ba, 2015)' for optimization but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | For each iteration we set n G = 1 and n S = 10. We use a generic generator with only three convolutional layers, and our input noise z has 100 dimensions. We use Adam (Kingma and Ba, 2015) with cosine annealing, with an initial learning rate of 2 10 3. We set β = 250 unless otherwise stated. For our baselines, we choose the same settings used to train the teacher and student in the literature, namely SGD with momentum 0.9, and weight decay 5 10 4. We scale the number of epochs such that the number of iterations is the same for all M. The initial learning rate is set to 0.1 and is divided by 5 at 30%, 60%, and 80% of the run. |