STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
Authors: Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Cornell University 2Google 3Google Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Proposed STEP Algorithm |
| Open Source Code | No | The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | CIFAR10/100 dataset (Krizhevsky et al., 2009), GLUE benchmark (Wang et al., 2018), WMT17 De-En Translation task following (Vaswani et al., 2017), Wikitext-2 and Wikitext-103 (Merity et al., 2016). |
| Dataset Splits | No | The paper mentions using "GLUE development set" for BERT fine-tuning, which implies a validation set. However, it does not provide specific details on train/validation/test splits (percentages, sample counts, or citations to predefined splits for all datasets) for all experiments, such as CIFAR or Wikitext. |
| Hardware Specification | Yes | All of the experiments run on a Google Cloud TPUv3-8 virtual machine. |
| Software Dependencies | No | The paper mentions using "deep learning libraries (Paszke et al., 2019; Heek et al., 2020)" (referring to PyTorch and Flax) but does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | For all the Adam-specific hyperparameters we adopt the default values: {β1 =0.9, β2 =0.999, ϵ= 1e 8}. For the CIFAR tasks, we adopted batch size 128 and tune the learning rate from {1e 4, 5e 5, 1e 5}; for BERT and GPT-2 fine-tuning we follow (Tang et al., 2021) and tune batch size from {8,16,32} and learning rate from {1e 4,5e 5,1e 5}; for WMT machine translation we follow the exact setup of (Vaswani et al., 2017) and (Kao et al., 2022). |