Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

Authors: Vithursan Thangarasa, Shreyas Saxena, Abhay Gupta, Sean Lie

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results consistently show DST s superiority over static sparse training. We empirically validate the consistent advantage of DST over static sparse training for Sparse-IFT networks. We show consistent benefits of Sparse-IFT across computer vision and natural language processing domains. We provide an extended set of empirical results in Table 7 to help validate the importance of training with and without non-linearity by training configurations of the Sparse Parallel, Factorized, and Doped IFT families at different levels of sparsity.
Researcher Affiliation Industry 1Cerebras Systems Inc, California, USA, Work done while at Cerebras. Correspondence to: Vithursan Thangarasa <vithu@cerebras.net>.
Pseudocode No The paper describes its methods and transformations using textual descriptions and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https: //github.com/Cerebras Research/Sparse-IFT.
Open Datasets Yes Our empirical results consistently show DST s superiority over static sparse training. All experiments utilize the Res Net-18 architecture on CIFAR-100 with published settings (De Vries & Taylor, 2017). We apply the best-performing Sparse-IFT transformations (Sparse Wide IFT and Sparse Parallel IFT) from CIFAR-100 to Image Net using Res Net-18. We follow published training settings for Image Net (Nvidia, 2023). We evaluate them on 1) object detection on MS COCO 2017 (Lin et al., 2014b), and 2) semantic segmentation on City Scapes (Cordts et al., 2016). We pre-train the Sparse Wide IFT GPT-3 Small model at s {50%, 75%} from scratch on the Pile (Gao et al., 2020) dataset using SET (Mocanu et al., 2018).
Dataset Splits Yes For object detection, we adopt Retina Net (Lin et al., 2017b) from the MMDetection open-source toolbox (Chen et al., 2019) and report results in the standardized training setting. This dataset contains 118K training, 5K validation (minival), and 20K test-dev images.
Hardware Specification Yes We showcase the practical value of Sparse-IFT with real-world timings for training on the Cerebras CS2 (Lie, 2023) and inference with Neural Magic Deep Sparse (Neural Magic, 2021) using unstructured sparsity. Our setup employs a Res Net-18 model and performs batched inference of 64 images from Image Net at 224 224 resolution on Intel Cascade Lake CPUs, known for their AVX-512 support. Our experimental setup involves pre-training a GPT-3 model on the CS-2.
Software Dependencies No The paper mentions software like 'MMDetection', 'MMSegmentation', 'Deep Sparse', and 'LM-eval-harness' but does not provide specific version numbers for any of these software components or other key dependencies like PyTorch or Python.
Experiment Setup Yes Our implementation of CIFAR-100 follows the setup from (De Vries & Taylor, 2017) for Res Nets. We train the models for 200 epochs with batches of 128 using SGD, Nesterov momentum of 0.9, and weight-decay of 5 10 4. The learning rate is initially set to 0.1 and is scheduled to decay to decrease by a factor of 5x after each of the 60th, 120th, and 160th epochs. For Res Nets, we replicate the settings recommended by Nvidia (Nvidia, 2019), which uses the SGD optimizer with a momentum of 0.875 and weight decay of 3.0517578125 10 5. The model is trained with label smoothing (Szegedy et al., 2016) of 0.1 and mixed precision (Micikevicius et al., 2018) for the standard 90 epochs using a cosine-decay learning rate schedule with an initial learning rate of 0.256 for a batch size of 256. To train all GPT models, we use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.95 and ϵ = 10 8. The global norm is clipped at 1.0, and a weight decay of 0.1 is used. There is a learning rate warmup over the first 375M tokens, followed by a cosine decay to 10% of the peak learning rate. The context window size is 2048 following (Brown et al., 2020).