Sparsest Models Elude Pruning: An Exposé of Pruning’s Current Capabilities

Authors: Stephen Zhang, Vardan Papyan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted an extensive series of 485,838 experiments, applying a range of state-of-the-art pruning algorithms to a synthetic dataset we created, named the Cubist Spiral.
Researcher Affiliation Academia Stephen Zhang 1 Vardan Papyan 1 1Department of Mathematics, University of Toronto, Toronto, Canada. Correspondence to: Stephen Zhang <stephenn.zhang@mail.utoronto.ca>.
Pseudocode Yes Algorithm 1: Combinatorial Search
Open Source Code Yes Through an empirical study (code available on Git Hub), we uncover the following deficiencies:
Open Datasets No A synthetic dataset named the Cubist Spiral, depicted in Figure 2b. The simplicity inherent in the dataset leads to interpretable sparse models that are amenable to visualization and analysis.
Dataset Splits No We pick 50,000 points spaced evenly along the spiral divided equally between the two classes. This deliberate choice of a large training set stems from our desire to separate any issues related to generalization when evaluating the efficacy of pruning algorithms.
Hardware Specification No This research was enabled in part by support provided by Compute Ontario (https://www.computeontario.ca) and the Digital Research Alliance of Canada (https: //alliancecan.ca/en).
Software Dependencies No The first thread uses Py Torch to produce the input variables, post-activation states, and model output by inputting a 512 512 grid of evenly spaced points in the square [ 2.25, 2.25] [ 2.25, 2.25]. It saves these tensors as well as the model parameters as global variables.
Experiment Setup Yes We train the model parameters for 50 epochs using stochastic gradient descent (SGD) with momentum 0.9 and a batch size of 128. Parameters outside of the determined model mask are constrained to be zero. A weight decay is applied for all experiments and set to 5e 4. For the pruning experiments, learning rates {0.05, 0.1, 0.2} are used while for the combinatorial search, only {0.05, 0.1} are used. We also utilize three learning rate schedulers: constant learning rate, a cosine annealing scheduler, and a decay of 0.1 applied at epochs 15 and 30.