Rigging the Lottery: Making All Tickets Winners
Authors: Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. |
| Researcher Affiliation | Industry | 1Google Brain 2Deep Mind. Correspondence to: Utku Evci <evcu@google.com>, Erich Elsen <eriche@google.com>. |
| Pseudocode | Yes | Algorithm 1 RigL |
| Open Source Code | Yes | *Code available at github.com/google-research/rigl |
| Open Datasets | Yes | Our experiments include image classification using CNNs on the ImageNet-2012 (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky, 2009) datasets and character based language modeling using RNNs with the WikiText-103 dataset (Merity et al., 2016). |
| Dataset Splits | No | The paper mentions training on ImageNet-2012, CIFAR-10, and WikiText-103 datasets, and refers to 'validation loss' for the language modeling task. However, it does not explicitly provide specific training/validation/test split percentages, sample counts, or direct references to how the data was partitioned for each dataset (e.g., '80/10/10 split' or 'standard splits from X'). |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments, such as particular GPU or CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions a 'Tensorflow implementation' but does not specify the version number of TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | For all dynamic sparse training methods (SET, SNFS, RigL), we use the same update schedule with T = 100 and = 0.3 unless stated otherwise. ... We set Tend to 25k for ImageNet 2012 and 75k for CIFAR-10 training... For ImageNet-2012: We set the momentum coefficient of the optimizer to 0.9, L2 regularization coefficient to 0.0001, and label smoothing to 0.1. The learning rate schedule starts with a linear warm up reaching its maximum value of 1.6 at epoch 5 which is then dropped by a factor of 10 at epochs 30, 70 and 90. We train our networks with a batch size of 4096 for 32000 steps... For CIFAR-10: The learning rate starts at 0.1 which is scaled down by a factor of 5 every 30,000 iterations. We use an L2 regularization coefficient of 5e-4, a batch size of 128 and a momentum coefficient of 0.9. |