Scaling Laws for Sparsely-Connected Foundation Models
Authors: Elias Frantar, Carlos Riquelme Ruiz, Neil Houlsby, Dan Alistarh, Utku Evci
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate empirically across model and data scales; on Vi T/JFT-4B and T5/C4. These results allow us to characterize the optimal sparsity, the sparsity level which yields the best performance for a given effective model size and training budget. We execute extensive training sweeps across sparsity, size and data, which we then subsequently use to develop scaling laws. Now follows a very brief summary of our main setup; a detailed discussion of all our choices, including the experiment grid and hyperparameters, can be found in Appendix A. |
| Researcher Affiliation | Collaboration | 1 Google Deep Mind, 2 ISTAustria; correspondence: elias.frantar@ist.ac.at |
| Pseudocode | Yes | Algorithm 1 Prune weights w to sparsity s 1 n/m where each group of m weights contains at most n zeros. |
| Open Source Code | Yes | We provide pruning and scaling law fitting code at: github.com/google-research/jaxpruner/ tree/main/jaxpruner/projects/bigsparse. |
| Open Datasets | Yes | We use the massive JFT-4B (Google, 2023a) and C4 (Raffel et al., 2020a) datasets, which are several orders of magnitude larger than what has been employed so far by the vast majority of work on sparsity. |
| Dataset Splits | No | The paper mentions using "validation loss" and conducting training sweeps, but it does not specify the explicit splits (e.g., percentages or counts) for the training, validation, or test datasets used in the experiments. |
| Hardware Specification | No | The paper discusses hardware-friendly sparsity patterns and specialized hardware/software for acceleration (e.g., "hardware-friendly n:m patterns", "custom hardware", "newer generations of NVIDIA GPUs"), but it does not specify the actual hardware (e.g., specific GPU or CPU models, memory, or cloud instances) used to run the experiments described in the paper. |
| Software Dependencies | No | The paper mentions using the "Jaxpruner library (Lee et al., 2023)" and "Ada Factor-based (Shazeer & Stern, 2018) optimizers". While these are specific software components, the paper does not provide version numbers for these or any other software dependencies, such as the underlying deep learning framework (e.g., JAX, TensorFlow, PyTorch). |
| Experiment Setup | Yes | We execute extensive training sweeps across sparsity, size and data, which we then subsequently use to develop scaling laws. Now follows a very brief summary of our main setup; a detailed discussion of all our choices, including the experiment grid and hyperparameters, can be found in Appendix A. In terms of specific hyper-parameters, we prune using a cubic schedule starting after 25% of training and ending at 75%, updating every 100 steps. |