reproducibilityindex.ai

Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

Authors: Vithursan Thangarasa, Shreyas Saxena, Abhay Gupta, Sean Lie

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results consistently show DST s superiority over static sparse training. We empirically validate the consistent advantage of DST over static sparse training for Sparse-IFT networks. We show consistent benefits of Sparse-IFT across computer vision and natural language processing domains. We provide an extended set of empirical results in Table 7 to help validate the importance of training with and without non-linearity by training configurations of the Sparse Parallel, Factorized, and Doped IFT families at different levels of sparsity.
Researcher Affiliation	Industry	1Cerebras Systems Inc, California, USA, Work done while at Cerebras. Correspondence to: Vithursan Thangarasa <vithu@cerebras.net>.
Pseudocode	No	The paper describes its methods and transformations using textual descriptions and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https: //github.com/Cerebras Research/Sparse-IFT.
Open Datasets	Yes	Our empirical results consistently show DST s superiority over static sparse training. All experiments utilize the Res Net-18 architecture on CIFAR-100 with published settings (De Vries & Taylor, 2017). We apply the best-performing Sparse-IFT transformations (Sparse Wide IFT and Sparse Parallel IFT) from CIFAR-100 to Image Net using Res Net-18. We follow published training settings for Image Net (Nvidia, 2023). We evaluate them on 1) object detection on MS COCO 2017 (Lin et al., 2014b), and 2) semantic segmentation on City Scapes (Cordts et al., 2016). We pre-train the Sparse Wide IFT GPT-3 Small model at s {50%, 75%} from scratch on the Pile (Gao et al., 2020) dataset using SET (Mocanu et al., 2018).
Dataset Splits	Yes	For object detection, we adopt Retina Net (Lin et al., 2017b) from the MMDetection open-source toolbox (Chen et al., 2019) and report results in the standardized training setting. This dataset contains 118K training, 5K validation (minival), and 20K test-dev images.
Hardware Specification	Yes	We showcase the practical value of Sparse-IFT with real-world timings for training on the Cerebras CS2 (Lie, 2023) and inference with Neural Magic Deep Sparse (Neural Magic, 2021) using unstructured sparsity. Our setup employs a Res Net-18 model and performs batched inference of 64 images from Image Net at 224 224 resolution on Intel Cascade Lake CPUs, known for their AVX-512 support. Our experimental setup involves pre-training a GPT-3 model on the CS-2.
Software Dependencies	No	The paper mentions software like 'MMDetection', 'MMSegmentation', 'Deep Sparse', and 'LM-eval-harness' but does not provide specific version numbers for any of these software components or other key dependencies like PyTorch or Python.
Experiment Setup	Yes	Our implementation of CIFAR-100 follows the setup from (De Vries & Taylor, 2017) for Res Nets. We train the models for 200 epochs with batches of 128 using SGD, Nesterov momentum of 0.9, and weight-decay of 5 10 4. The learning rate is initially set to 0.1 and is scheduled to decay to decrease by a factor of 5x after each of the 60th, 120th, and 160th epochs. For Res Nets, we replicate the settings recommended by Nvidia (Nvidia, 2019), which uses the SGD optimizer with a momentum of 0.875 and weight decay of 3.0517578125 10 5. The model is trained with label smoothing (Szegedy et al., 2016) of 0.1 and mixed precision (Micikevicius et al., 2018) for the standard 90 epochs using a cosine-decay learning rate schedule with an initial learning rate of 0.256 for a batch size of 256. To train all GPT models, we use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.95 and ϵ = 10 8. The global norm is clipped at 1.0, and a weight decay of 0.1 is used. There is a learning rate warmup over the first 375M tokens, followed by a cosine decay to 10% of the peak learning rate. The context window size is 2048 following (Brown et al., 2020).