Why Random Pruning Is All We Need to Start Sparse
Authors: Advait Harshal Gadhikar, Sohom Mukherjee, Rebekka Burkholz
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We offer a theoretical explanation of how random masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/ log(1/sparsity)... We demonstrate the feasibility of this approach in experiments for different pruning methods and propose particularly effective choices of initial layer-wise sparsity ratios of the random source network. |
| Researcher Affiliation | Academia | 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Germany. Correspondence to: Advait Gadhikar <advait.gadhikar@cispa.de>. |
| Pseudocode | No | The paper describes algorithms and procedures in prose and mathematical notation but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ Relational ML/sparse_to_sparse. |
| Open Datasets | Yes | We conduct our experiments with two datasets built for image classification tasks: CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) and Tiny Image Net (Russakovsky et al., 2015b). |
| Dataset Splits | Yes | We use the validation set provided by the creators of Tiny Imagenet (Russakovsky et al., 2015b) as a test set to measure the generalization performance of our trained models. |
| Hardware Specification | Yes | All our experiments were run with 4 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions using Stochastic Gradient Descent (SGD) and Adam optimizer, and states that its code builds on work by Liu et al. (2021), Tanaka et al. (2020), and Kusupati et al. (2020). However, it does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Each model is trained using Stochastic Gradient Descent (SGD) with learning rate 0.1 and momentum 0.9 with weight decay 0.0005 and batch size 128. We use the same hyperparameters as (Ma et al., 2021) and train every model for 160 epochs. |