Why Random Pruning Is All We Need to Start Sparse

Authors: Advait Harshal Gadhikar, Sohom Mukherjee, Rebekka Burkholz

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We offer a theoretical explanation of how random masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/ log(1/sparsity)... We demonstrate the feasibility of this approach in experiments for different pruning methods and propose particularly effective choices of initial layer-wise sparsity ratios of the random source network.
Researcher Affiliation Academia 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany. Correspondence to: Advait Gadhikar <advait.gadhikar@cispa.de>.
Pseudocode No The paper describes algorithms and procedures in prose and mathematical notation but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Our code is available at https://github.com/ Relational ML/sparse_to_sparse.
Open Datasets Yes We conduct our experiments with two datasets built for image classification tasks: CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) and Tiny Image Net (Russakovsky et al., 2015b).
Dataset Splits Yes We use the validation set provided by the creators of Tiny Imagenet (Russakovsky et al., 2015b) as a test set to measure the generalization performance of our trained models.
Hardware Specification Yes All our experiments were run with 4 Nvidia A100 GPUs.
Software Dependencies No The paper mentions using Stochastic Gradient Descent (SGD) and Adam optimizer, and states that its code builds on work by Liu et al. (2021), Tanaka et al. (2020), and Kusupati et al. (2020). However, it does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes Each model is trained using Stochastic Gradient Descent (SGD) with learning rate 0.1 and momentum 0.9 with weight decay 0.0005 and batch size 128. We use the same hyperparameters as (Ma et al., 2021) and train every model for 160 epochs.