On the Predictability of Pruning Across Scales

Authors: Jonathan S Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of magnitude in depth, width, dataset size, and density.
Researcher Affiliation Academia 1MIT CSAIL. Correspondence to: Jonathan Rosenfeld <jonsr@csail.mit.edu>.
Pseudocode Yes For a formal statement of this pruning algorithm, see Appendix A.
Open Source Code No The paper does not provide a direct link to open-source code for its methodology, nor does it explicitly state that code will be released.
Open Datasets Yes In the main body of the paper, we study the image classification tasks CIFAR-10 and Image Net. Our scaling law predicts the error when training with the entire dataset and smaller subsamples. ... To subsample a dataset to a size of n, we randomly select n of the training examples without regard to individual classes such that in expectation we preserve the original dataset distribution (we always retain the entire test set).
Dataset Splits No The paper mentions training data, subsamples, and retaining the test set. It describes training three replicates with different seeds. However, it does not explicitly describe a separate validation split or how it was used in the experimental setup.
Hardware Specification No The paper mentions 'TPU resources' and 'GPU resources' provided by Google and IBM respectively, but it does not specify the exact models or configurations of these hardware components (e.g., specific TPU versions like v2/v3, or GPU models like V100/A100).
Software Dependencies No The paper does not provide specific software names with version numbers that would be necessary for reproduction.
Experiment Setup Yes We study iterative magnitude pruning (IMP)... IMP prunes by removing a fraction typically 20%, as we do here of individual weights with the lowest magnitudes... For IMP, we use a practice called weight rewinding... in which the values of unpruned weights are rewound to their values earlier in training (in our case, epoch 10) and the training process is repeated from there to completion. ... To achieve density levels below 80%, this process is repeated iteratively pruning by 20%, rewinding, and retraining until a desired density level is reached. ... See Appendix B for full details on architectures and hyperparameters.