The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Authors: Jonathan Frankle, Michael Carbin
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. |
| Researcher Affiliation | Academia | Jonathan Frankle MIT CSAIL jfrankle@csail.mit.edu Michael Carbin MIT CSAIL mcarbin@csail.mit.edu |
| Pseudocode | Yes | Strategy 1: Iterative pruning with resetting. 1. Randomly initialize a neural network f(x; m θ) where θ = θ0 and m = 1|θ| is a mask. 2. Train the network for j iterations, reaching parameters m θj. 3. Prune s% of the parameters, creating an updated mask m where Pm = (Pm s)%. 4. Reset the weights of the remaining portion of the network to their values in θ0. That is, let θ = θ0. 5. Let m = m and repeat steps 2 through 4 until a sufficiently pruned network has been obtained. |
| Open Source Code | No | The paper does not contain an explicit statement about the authors' source code being made available or a link to a code repository for their methodology. |
| Open Datasets | Yes | We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. |
| Dataset Splits | Yes | We randomly sampled a 5,000-example validation set from the training set and used the remaining 55,000 training examples as our training set for the rest of the paper (including Section 2). |
| Hardware Specification | No | We gratefully acknowledge IBM, which through the MIT-IBM Watson AI Lab contributed the computational resources necessary to conduct the experiments in this paper." (This does not specify exact hardware models or configurations.) |
| Software Dependencies | No | The paper mentions software components and optimizers (e.g., 'Adam optimizer', 'SGD', 'dropout', 'batchnorm'), but does not provide specific version numbers for any of them (e.g., 'PyTorch 1.9', 'TensorFlow 2.x'). |
| Experiment Setup | Yes | The training set is presented to the network in mini-batches of 60 examples; at each epoch, the entire training set is shuffled." and "We use the Adam optimizer (Kingma & Ba, 2014) and Gaussian Glorot initialization (Glorot & Bengio, 2010)." and "We use a batch size of 128. We use batch normalization. We use weight decay of 0.0001." |