STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition

Authors: Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.
Researcher Affiliation Collaboration 1Department of Computer Science, Cornell University 2Google 3Google Deep Mind.
Pseudocode Yes Algorithm 1 Proposed STEP Algorithm
Open Source Code No The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes CIFAR10/100 dataset (Krizhevsky et al., 2009), GLUE benchmark (Wang et al., 2018), WMT17 De-En Translation task following (Vaswani et al., 2017), Wikitext-2 and Wikitext-103 (Merity et al., 2016).
Dataset Splits No The paper mentions using "GLUE development set" for BERT fine-tuning, which implies a validation set. However, it does not provide specific details on train/validation/test splits (percentages, sample counts, or citations to predefined splits for all datasets) for all experiments, such as CIFAR or Wikitext.
Hardware Specification Yes All of the experiments run on a Google Cloud TPUv3-8 virtual machine.
Software Dependencies No The paper mentions using "deep learning libraries (Paszke et al., 2019; Heek et al., 2020)" (referring to PyTorch and Flax) but does not provide specific version numbers for these or any other software components.
Experiment Setup Yes For all the Adam-specific hyperparameters we adopt the default values: {β1 =0.9, β2 =0.999, ϵ= 1e 8}. For the CIFAR tasks, we adopted batch size 128 and tune the learning rate from {1e 4, 5e 5, 1e 5}; for BERT and GPT-2 fine-tuning we follow (Tang et al., 2021) and tune batch size from {8,16,32} and learning rate from {1e 4,5e 5,1e 5}; for WMT machine translation we follow the exact setup of (Vaswani et al., 2017) and (Kao et al., 2022).