Rigging the Lottery: Making All Tickets Winners

Authors: Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103.
Researcher Affiliation Industry 1Google Brain 2Deep Mind. Correspondence to: Utku Evci <evcu@google.com>, Erich Elsen <eriche@google.com>.
Pseudocode Yes Algorithm 1 RigL
Open Source Code Yes *Code available at github.com/google-research/rigl
Open Datasets Yes Our experiments include image classification using CNNs on the ImageNet-2012 (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky, 2009) datasets and character based language modeling using RNNs with the WikiText-103 dataset (Merity et al., 2016).
Dataset Splits No The paper mentions training on ImageNet-2012, CIFAR-10, and WikiText-103 datasets, and refers to 'validation loss' for the language modeling task. However, it does not explicitly provide specific training/validation/test split percentages, sample counts, or direct references to how the data was partitioned for each dataset (e.g., '80/10/10 split' or 'standard splits from X').
Hardware Specification No The paper does not specify the hardware used for running experiments, such as particular GPU or CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions a 'Tensorflow implementation' but does not specify the version number of TensorFlow or any other software dependencies.
Experiment Setup Yes For all dynamic sparse training methods (SET, SNFS, RigL), we use the same update schedule with T = 100 and = 0.3 unless stated otherwise. ... We set Tend to 25k for ImageNet 2012 and 75k for CIFAR-10 training... For ImageNet-2012: We set the momentum coefficient of the optimizer to 0.9, L2 regularization coefficient to 0.0001, and label smoothing to 0.1. The learning rate schedule starts with a linear warm up reaching its maximum value of 1.6 at epoch 5 which is then dropped by a factor of 10 at epochs 30, 70 and 90. We train our networks with a batch size of 4096 for 32000 steps... For CIFAR-10: The learning rate starts at 0.1 which is scaled down by a factor of 5 every 30,000 iterations. We use an L2 regularization coefficient of 5e-4, a batch size of 128 and a momentum coefficient of 0.9.