Moonshine: Distilling with Cheap Convolutions

Authors: Elliot J. Crowley, Gavin Gray, Amos J. Storkey

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data. In Section 4 we evaluate student networks with these blocks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the efficacy of such networks for the tasks of Image Net (Russakovsky et al., 2015) classification, and semantic segmentation on the Cityscapes dataset (Cordts et al., 2016).
Researcher Affiliation Academia Elliot J. Crowley School of Informatics University of Edinburgh elliot.j.crowley@ed.ac.uk Gavin Gray School of Informatics University of Edinburgh g.d.b.gray@ed.ac.uk Amos Storkey School of Informatics University of Edinburgh a.storkey@ed.ac.uk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce these experiments is available at https://github.com/BayesWatch/pytorch-moonshine.
Open Datasets Yes In Section 4 we evaluate student networks with these blocks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the efficacy of such networks for the tasks of Image Net (Russakovsky et al., 2015) classification, and semantic segmentation on the Cityscapes dataset (Cordts et al., 2016).
Dataset Splits No The paper uses standard benchmark datasets like CIFAR-10, CIFAR-100, ImageNet, and Cityscapes and refers to validation sets, but does not explicitly provide specific percentages, sample counts, or detailed methodology for splitting the data into training, validation, and test sets. It relies on implied standard splits for these datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions 'pytorch-moonshine' in its GitHub link, implying the use of PyTorch, but does not provide specific version numbers for any software dependencies like Python or PyTorch.
Experiment Setup Yes For training we used minibatches of size 128... Networks were trained for 200 epochs using SGD with momentum fixed at 0.9 with an initial learning rate of 0.1. The learning rate was reduced by a factor of 0.2 at the start of epochs 60, 120, and 160. For knowledge distillation we set α to 0.9 and used a temperature of 4. For attention transfer β was set to 1000. Models were trained for 100 epochs using SGD with an initial learning rate of 0.1, momentum of 0.9 and weight decay of 10 4. The learning rate was reduced by a factor of 10 every 30 epochs. Minibatches of size 256 were used across 4 GPUs. For encoder training, the outputs of layers 7, 12, and 16 were used for attention transfer with β = 1000. For decoder training, the outputs of layers 19 and 22 were also used and β was dropped to 600.