StrassenNets: Deep Learning with a Multiplication Budget
Authors: Michael Tschannen, Aran Khanna, Animashree Anandkumar
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on CIFAR-10 and Image Net show that our method applied to Res Net (He et al., 2016a) yields the same or higher accuracy than existing complexity reduction methods while using considerably fewer multiplications. For example, for Res Net18 our method reduces the number of multiplications by 99.63% while incurring a top-1 accuracy degradation of only 2.0% compared to the full-precision model on Image Net. |
| Researcher Affiliation | Collaboration | 1ETH Zürich, Zürich, Switzerland (most of this work was done while MT was at Amazon AI) 2Amazon AI, Palo Alto, CA, USA 3Caltech, Pasadena, CA, USA. |
| Pseudocode | Yes | see Fig. 1, right, and pseudocode in Appendix C. |
| Open Source Code | Yes | Code available at https://github.com/mitscha/ strassennets. |
| Open Datasets | Yes | We apply our method to all convolution layers... of the Res Net architecture (He et al., 2016a) to create the so-called Strassen-Res Net (ST-Res Net). We evaluate ST-Res Net on CIFAR-10 (10 classes, 50k training images, 10k testing images) (Krizhevsky & Hinton, 2009) and Image Net (ILSVRC2012; 1k classes, 1.2M training images, 50k testing images) (Russakovsky et al., 2015) for different choices of r, p, g, and compare the accuracy of ST-Res Net to related works. All models were trained from scratch... We apply our method to the character-level language model described in (Kim et al., 2016a) and evaluate it on the English Penn Treebank (PTB with word vocabulary size 10k, character vocabulary size 51, 1M training tokens, standard train-validation-test split, see (Kim et al., 2016a)) (Marcus et al., 1993). |
| Dataset Splits | Yes | The validation accuracy is computed from center crops. ... We apply our method to the character-level language model described in (Kim et al., 2016a) and evaluate it on the English Penn Treebank (PTB with word vocabulary size 10k, character vocabulary size 51, 1M training tokens, standard train-validation-test split, see (Kim et al., 2016a)) (Marcus et al., 1993). |
| Hardware Specification | No | The paper mentions support by 'AWS Cloud Credits for Research program' and discusses future work related to 'FPGAs and ASICs' as target platforms, but it does not specify the exact hardware (e.g., GPU/CPU models, memory) used for conducting the experiments described in the paper. |
| Software Dependencies | No | The paper mentions 'SGD' as an optimizer but does not specify any software platforms (e.g., TensorFlow, PyTorch) or libraries with their version numbers that were used to implement and run the experiments. |
| Experiment Setup | Yes | We generate a training set containing 100k pairs (Ai, Bi) with entries i.i.d. uniform on [ 1, 1], train the SPN with full-precision weights (initialized i.i.d. uniform on [ 1, 1]) for one epoch with SGD (learning rate 0.1, momentum 0.9, mini-batch size 4), activate quantization, and train for another epoch (with learning rate 0.001). ... We train for 250 epochs with initial learning rate 0.1 and mini-batch size 128, multiplying the learning rate by 0.1 after 150 and 200 epochs. ... We use an initial learning rate of 0.05 and mini-batch size 256, with two different learning rate schedules... All models are trained for 40 epochs using SGD with minibatch size 20 and initial learning rate 2... |