Moonshine: Distilling with Cheap Convolutions
Authors: Elliot J. Crowley, Gavin Gray, Amos J. Storkey
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data. In Section 4 we evaluate student networks with these blocks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the efficacy of such networks for the tasks of Image Net (Russakovsky et al., 2015) classification, and semantic segmentation on the Cityscapes dataset (Cordts et al., 2016). |
| Researcher Affiliation | Academia | Elliot J. Crowley School of Informatics University of Edinburgh elliot.j.crowley@ed.ac.uk Gavin Gray School of Informatics University of Edinburgh g.d.b.gray@ed.ac.uk Amos Storkey School of Informatics University of Edinburgh a.storkey@ed.ac.uk |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to reproduce these experiments is available at https://github.com/BayesWatch/pytorch-moonshine. |
| Open Datasets | Yes | In Section 4 we evaluate student networks with these blocks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Finally, in Section 5 we examine the efficacy of such networks for the tasks of Image Net (Russakovsky et al., 2015) classification, and semantic segmentation on the Cityscapes dataset (Cordts et al., 2016). |
| Dataset Splits | No | The paper uses standard benchmark datasets like CIFAR-10, CIFAR-100, ImageNet, and Cityscapes and refers to validation sets, but does not explicitly provide specific percentages, sample counts, or detailed methodology for splitting the data into training, validation, and test sets. It relies on implied standard splits for these datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'pytorch-moonshine' in its GitHub link, implying the use of PyTorch, but does not provide specific version numbers for any software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | For training we used minibatches of size 128... Networks were trained for 200 epochs using SGD with momentum fixed at 0.9 with an initial learning rate of 0.1. The learning rate was reduced by a factor of 0.2 at the start of epochs 60, 120, and 160. For knowledge distillation we set α to 0.9 and used a temperature of 4. For attention transfer β was set to 1000. Models were trained for 100 epochs using SGD with an initial learning rate of 0.1, momentum of 0.9 and weight decay of 10 4. The learning rate was reduced by a factor of 10 every 30 epochs. Minibatches of size 256 were used across 4 GPUs. For encoder training, the outputs of layers 7, 12, and 16 were used for attention transfer with β = 1000. For decoder training, the outputs of layers 19 and 22 were also used and β was dropped to 600. |