Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
Authors: Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, Hongsheng Li
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, We justify SR-STE s advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks.We conduct extensive experiments on various tasks with N:M fine-grained sparse nets, and provide benchmarks for N:M sparse net training to facilitate co-development of related software and hardware design. |
| Researcher Affiliation | Collaboration | Aojun Zhou1,2 , Yukun Ma3 , Junnan Zhu4, Jianbo Liu2, Zhijie Zhang1, Kun Yuan1 Wenxiu Sun1 Hongsheng Li2 1Sense Time Research, 2CUHK-Sensetime Joint Lab, CUHK, 3Northwestern University, 4NLPR, CASIA |
| Pseudocode | Yes | Algorithm 1 Training N:M sparse Neural Networks from Scratch of SR-STE |
| Open Source Code | Yes | Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity. |
| Open Datasets | Yes | Image Net-1K (Deng et al., 2009) is a large-scale classification task...All experiments are performed on the challenging MS COCO 2017 dataset (Lin et al., 2014)...The optical flow prediction is conducted on the Flying Chairs (Dosovitskiy et al., 2015) dataset...For English-German translation, the training set consists of about 4.5 million bilingual sentence pairs from WMT 2014. |
| Dataset Splits | Yes | Image Net-1K dataset has about 1.2 million training images and 50 thousand validation images...We train models on the training dataset train-2017, and evaluate models on the validation dataset val-2017...The training dataset contains 22,232 samples and the validation dataset contains 640 test samples...We use newstest2013 as the validation set and newstest2014 as the test set. |
| Hardware Specification | Yes | Specifically, a 2 : 4 sparse network could achieve 2 speed-up without performance drop on Nvidia A100 GPUs.We employ four Titan XP GPUs to train both the baseline and our model. |
| Software Dependencies | No | The paper mentions software frameworks and optimizers like MMDetection, RAFT, and Adam optimizer, but it does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | All Image Net-1K experiments trained on images of 224 224, the dense models baselines following the hyperparameter setting in (He et al., 2019). Specifically, all models are trained with batch size of 256 over 120 epochs and learning rates are annealed from 0.1 to 0 with a cosine scheduler and first 5 epochs the learning rate linearly increases from 0 to 0.1.Each mini-batch on one GPU contains a set of sentence pairs with roughly 4,096 source and 4,096 target tokens. We use Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9 and β2 = 0.98. For our model, we train for 300,000 steps. |