Sparsest Models Elude Pruning: An Exposé of Pruning’s Current Capabilities
Authors: Stephen Zhang, Vardan Papyan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted an extensive series of 485,838 experiments, applying a range of state-of-the-art pruning algorithms to a synthetic dataset we created, named the Cubist Spiral. |
| Researcher Affiliation | Academia | Stephen Zhang 1 Vardan Papyan 1 1Department of Mathematics, University of Toronto, Toronto, Canada. Correspondence to: Stephen Zhang <stephenn.zhang@mail.utoronto.ca>. |
| Pseudocode | Yes | Algorithm 1: Combinatorial Search |
| Open Source Code | Yes | Through an empirical study (code available on Git Hub), we uncover the following deficiencies: |
| Open Datasets | No | A synthetic dataset named the Cubist Spiral, depicted in Figure 2b. The simplicity inherent in the dataset leads to interpretable sparse models that are amenable to visualization and analysis. |
| Dataset Splits | No | We pick 50,000 points spaced evenly along the spiral divided equally between the two classes. This deliberate choice of a large training set stems from our desire to separate any issues related to generalization when evaluating the efficacy of pruning algorithms. |
| Hardware Specification | No | This research was enabled in part by support provided by Compute Ontario (https://www.computeontario.ca) and the Digital Research Alliance of Canada (https: //alliancecan.ca/en). |
| Software Dependencies | No | The first thread uses Py Torch to produce the input variables, post-activation states, and model output by inputting a 512 512 grid of evenly spaced points in the square [ 2.25, 2.25] [ 2.25, 2.25]. It saves these tensors as well as the model parameters as global variables. |
| Experiment Setup | Yes | We train the model parameters for 50 epochs using stochastic gradient descent (SGD) with momentum 0.9 and a batch size of 128. Parameters outside of the determined model mask are constrained to be zero. A weight decay is applied for all experiments and set to 5e 4. For the pruning experiments, learning rates {0.05, 0.1, 0.2} are used while for the combinatorial search, only {0.05, 0.1} are used. We also utilize three learning rate schedulers: constant learning rate, a cosine annealing scheduler, and a decay of 0.1 applied at epochs 15 and 30. |