The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Authors: Jonathan Frankle, Michael Carbin

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10.
Researcher Affiliation Academia Jonathan Frankle MIT CSAIL jfrankle@csail.mit.edu Michael Carbin MIT CSAIL mcarbin@csail.mit.edu
Pseudocode Yes Strategy 1: Iterative pruning with resetting. 1. Randomly initialize a neural network f(x; m θ) where θ = θ0 and m = 1|θ| is a mask. 2. Train the network for j iterations, reaching parameters m θj. 3. Prune s% of the parameters, creating an updated mask m where Pm = (Pm s)%. 4. Reset the weights of the remaining portion of the network to their values in θ0. That is, let θ = θ0. 5. Let m = m and repeat steps 2 through 4 until a sufficiently pruned network has been obtained.
Open Source Code No The paper does not contain an explicit statement about the authors' source code being made available or a link to a code repository for their methodology.
Open Datasets Yes We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10.
Dataset Splits Yes We randomly sampled a 5,000-example validation set from the training set and used the remaining 55,000 training examples as our training set for the rest of the paper (including Section 2).
Hardware Specification No We gratefully acknowledge IBM, which through the MIT-IBM Watson AI Lab contributed the computational resources necessary to conduct the experiments in this paper." (This does not specify exact hardware models or configurations.)
Software Dependencies No The paper mentions software components and optimizers (e.g., 'Adam optimizer', 'SGD', 'dropout', 'batchnorm'), but does not provide specific version numbers for any of them (e.g., 'PyTorch 1.9', 'TensorFlow 2.x').
Experiment Setup Yes The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire training set is shuffled." and "We use the Adam optimizer (Kingma & Ba, 2014) and Gaussian Glorot initialization (Glorot & Bengio, 2010)." and "We use a batch size of 128. We use batch normalization. We use weight decay of 0.0001."