Scaling Laws for Sparsely-Connected Foundation Models

Authors: Elias Frantar, Carlos Riquelme Ruiz, Neil Houlsby, Dan Alistarh, Utku Evci

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate empirically across model and data scales; on Vi T/JFT-4B and T5/C4. These results allow us to characterize the optimal sparsity, the sparsity level which yields the best performance for a given effective model size and training budget. We execute extensive training sweeps across sparsity, size and data, which we then subsequently use to develop scaling laws. Now follows a very brief summary of our main setup; a detailed discussion of all our choices, including the experiment grid and hyperparameters, can be found in Appendix A.
Researcher Affiliation Collaboration 1 Google Deep Mind, 2 ISTAustria; correspondence: elias.frantar@ist.ac.at
Pseudocode Yes Algorithm 1 Prune weights w to sparsity s 1 n/m where each group of m weights contains at most n zeros.
Open Source Code Yes We provide pruning and scaling law fitting code at: github.com/google-research/jaxpruner/ tree/main/jaxpruner/projects/bigsparse.
Open Datasets Yes We use the massive JFT-4B (Google, 2023a) and C4 (Raffel et al., 2020a) datasets, which are several orders of magnitude larger than what has been employed so far by the vast majority of work on sparsity.
Dataset Splits No The paper mentions using "validation loss" and conducting training sweeps, but it does not specify the explicit splits (e.g., percentages or counts) for the training, validation, or test datasets used in the experiments.
Hardware Specification No The paper discusses hardware-friendly sparsity patterns and specialized hardware/software for acceleration (e.g., "hardware-friendly n:m patterns", "custom hardware", "newer generations of NVIDIA GPUs"), but it does not specify the actual hardware (e.g., specific GPU or CPU models, memory, or cloud instances) used to run the experiments described in the paper.
Software Dependencies No The paper mentions using the "Jaxpruner library (Lee et al., 2023)" and "Ada Factor-based (Shazeer & Stern, 2018) optimizers". While these are specific software components, the paper does not provide version numbers for these or any other software dependencies, such as the underlying deep learning framework (e.g., JAX, TensorFlow, PyTorch).
Experiment Setup Yes We execute extensive training sweeps across sparsity, size and data, which we then subsequently use to develop scaling laws. Now follows a very brief summary of our main setup; a detailed discussion of all our choices, including the experiment grid and hyperparameters, can be found in Appendix A. In terms of specific hyper-parameters, we prune using a cubic schedule starting after 25% of training and ending at 75%, updating every 100 steps.