Sparsity May Cry: Let Us Fail (Current) Sparse Neural Networks Together!
Authors: Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, AJAY KUMAR JAISWAL, Zhangyang Wang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our systemic evaluation of the most representative sparse algorithms reveals an important obscured observation: the state-of-the-art magnitudeand/or gradient-based sparse algorithms seemingly fail to perform on SMC-Bench when applied out-of-the-box, sometimes at significantly trivial sparsity as low as 5%. We further conduct a thorough investigation into the reasons for the failure of common SNNs. |
| Researcher Affiliation | Academia | Shiwei Liu1 , Tianlong Chen1 , Zhenyu Zhang1, Xuxi Chen1, Tianjin Huang2, Ajay Jaiswal1, Zhangyang Wang1 1University of Texas at Austin 2Eindhoven University of Technology |
| Pseudocode | No | The paper describes the various sparse neural network approaches and pruning algorithms verbally, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source SMC-Bench to assist researchers in building next-generation sparse algorithms that scale and generalize: https://github.com/VITA-Group/SMC-Bench. |
| Open Datasets | Yes | We consider three commonly used datasets for commonsense reasoning. (1) RACE (Lai et al., 2017) ... (2) Wino Grande (Sakaguchi et al., 2021) ... (3) Commonsense QA (CSQA) (Talmor et al., 2018)... (1) the widely used MAWPS benchmark (Koncel-Kedziorski et al., 2016)... (2) the arithmetic subset of ASDiv (Miao et al., 2021)... (3) the more challenging SVAMP (Patel et al., 2021) dataset... (1) Hot Protein (Chen et al., 2023)... (2) Meltome Atlas (Jarzab et al., 2020)... follow Liu et al. (2020); Tang et al. (2020) and choose 10 English-centric language pairs (Fr, Cs, De, Gu, Ja, My, Ro, Ru, Vi, Zh En) from an open source parallel corpus OPUS (OPU, 2020). |
| Dataset Splits | Yes | On MAWPS and ASDiv-A, models are trained with the training data and then evaluated on 5-fold cross-validation based on pre-assigned splits. |
| Hardware Specification | Yes | All models are trained with the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 1 10 5 using an A100 GPU. |
| Software Dependencies | No | The paper mentions using a 'sequence modeling toolkit Fairseq' and the 'Adam optimizer', but it does not specify version numbers for these or any other software dependencies, such as programming languages or libraries. |
| Experiment Setup | Yes | We follow the training settings of sequence modeling toolkit Fairseq (Ott et al., 2019) and fine-tune the pre-trained Ro BERTa on our datasets with a standard cross-entropy loss. All models are trained with the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 1 10 5 using an A100 GPU. For CSQA, we train the model for 3000 steps with a linear warmup of 150 steps and a batch size of 16. The dropout rate is set as 0.1. (Appendix A provides detailed hyperparameters for all models and datasets). |