Pruning Neural Networks at Initialization: Why Are We Missing the Mark?

Authors: Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, Michael Carbin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess proposals for doing so: SNIP (Lee et al., 2019), Gra SP (Wang et al., 2020), Syn Flow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, we find that they remain below the accuracy of magnitude pruning after training. We show that, unlike magnitude pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.
Researcher Affiliation Collaboration Jonathan Frankle MIT CSAIL Gintare Karolina Dziugaite Element AI Daniel M. Roy University of Toronto Vector Institute Michael Carbin MIT CSAIL
Pseudocode No The paper refers to 'Algorithm 2 of the paper' (Wang et al., 2020) for Gra SP implementation details, but does not include any pseudocode or algorithm blocks within its own text.
Open Source Code No The paper refers to 'The Git Hub implementation of Gra SP by Wang et al. (2020)' and 'A Git Hub repository associated with the paper' (Lee et al., 2019) for other researchers' code, but does not provide a statement about the release of its own source code for the methodology described.
Open Datasets Yes We use Res Net-20 and VGG-16 on CIFAR-10, Res Net-18 on Tiny Image Net, and Res Net-50 on Image Net. See Appendix A for hyperparameters.
Dataset Splits No The paper mentions using standard datasets like CIFAR-10, Tiny Image Net, and Image Net, but it does not explicitly provide details about the specific train/validation/test splits (e.g., percentages, sample counts, or the methodology for partitioning the data) used for its experiments.
Hardware Specification No The paper mentions running computations on 'CPU' and training on 'TPU', but it does not provide specific hardware details such as exact CPU or TPU models (e.g., 'Intel Core i7' or 'TPU v2').
Software Dependencies No The paper mentions using 'Py Torch' for implementation and 'Torch Vision' for network implementations, but it does not specify any version numbers for these software dependencies or any other ancillary software.
Experiment Setup Yes Appendix A.3 provides a detailed 'TRAINING' table for each network and dataset, specifying 'Epochs', 'Batch Opt.', 'Mom.', 'LR', 'LR Drop', 'Weight Decay', 'Initialization', 'Iters per Ep', and 'Rewind Iter' hyperparameters.