reproducibilityindex.ai

Moonshine: Distilling with Cheap Convolutions

Authors: Elliot J. Crowley, Gavin Gray, Amos J. Storkey

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and conﬁrm that distillation provides student network performance that is better than training that student architecture directly on data. In Section 4 we evaluate student networks with these blocks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the efﬁcacy of such networks for the tasks of Image Net (Russakovsky et al., 2015) classiﬁcation, and semantic segmentation on the Cityscapes dataset (Cordts et al., 2016).
Researcher Affiliation	Academia	Elliot J. Crowley School of Informatics University of Edinburgh elliot.j.crowley@ed.ac.uk Gavin Gray School of Informatics University of Edinburgh g.d.b.gray@ed.ac.uk Amos Storkey School of Informatics University of Edinburgh a.storkey@ed.ac.uk
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code to reproduce these experiments is available at https://github.com/BayesWatch/pytorch-moonshine.
Open Datasets	Yes	In Section 4 we evaluate student networks with these blocks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the efﬁcacy of such networks for the tasks of Image Net (Russakovsky et al., 2015) classiﬁcation, and semantic segmentation on the Cityscapes dataset (Cordts et al., 2016).
Dataset Splits	No	The paper uses standard benchmark datasets like CIFAR-10, CIFAR-100, ImageNet, and Cityscapes and refers to validation sets, but does not explicitly provide specific percentages, sample counts, or detailed methodology for splitting the data into training, validation, and test sets. It relies on implied standard splits for these datasets.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions 'pytorch-moonshine' in its GitHub link, implying the use of PyTorch, but does not provide specific version numbers for any software dependencies like Python or PyTorch.
Experiment Setup	Yes	For training we used minibatches of size 128... Networks were trained for 200 epochs using SGD with momentum ﬁxed at 0.9 with an initial learning rate of 0.1. The learning rate was reduced by a factor of 0.2 at the start of epochs 60, 120, and 160. For knowledge distillation we set α to 0.9 and used a temperature of 4. For attention transfer β was set to 1000. Models were trained for 100 epochs using SGD with an initial learning rate of 0.1, momentum of 0.9 and weight decay of 10 4. The learning rate was reduced by a factor of 10 every 30 epochs. Minibatches of size 256 were used across 4 GPUs. For encoder training, the outputs of layers 7, 12, and 16 were used for attention transfer with β = 1000. For decoder training, the outputs of layers 19 and 22 were also used and β was dropped to 600.