reproducibilityindex.ai

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Authors: Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, Hongsheng Li

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, We justify SR-STE s advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks.We conduct extensive experiments on various tasks with N:M ﬁne-grained sparse nets, and provide benchmarks for N:M sparse net training to facilitate co-development of related software and hardware design.
Researcher Affiliation	Collaboration	Aojun Zhou1,2 , Yukun Ma3 , Junnan Zhu4, Jianbo Liu2, Zhijie Zhang1, Kun Yuan1 Wenxiu Sun1 Hongsheng Li2 1Sense Time Research, 2CUHK-Sensetime Joint Lab, CUHK, 3Northwestern University, 4NLPR, CASIA
Pseudocode	Yes	Algorithm 1 Training N:M sparse Neural Networks from Scratch of SR-STE
Open Source Code	Yes	Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.
Open Datasets	Yes	Image Net-1K (Deng et al., 2009) is a large-scale classiﬁcation task...All experiments are performed on the challenging MS COCO 2017 dataset (Lin et al., 2014)...The optical ﬂow prediction is conducted on the Flying Chairs (Dosovitskiy et al., 2015) dataset...For English-German translation, the training set consists of about 4.5 million bilingual sentence pairs from WMT 2014.
Dataset Splits	Yes	Image Net-1K dataset has about 1.2 million training images and 50 thousand validation images...We train models on the training dataset train-2017, and evaluate models on the validation dataset val-2017...The training dataset contains 22,232 samples and the validation dataset contains 640 test samples...We use newstest2013 as the validation set and newstest2014 as the test set.
Hardware Specification	Yes	Speciﬁcally, a 2 : 4 sparse network could achieve 2 speed-up without performance drop on Nvidia A100 GPUs.We employ four Titan XP GPUs to train both the baseline and our model.
Software Dependencies	No	The paper mentions software frameworks and optimizers like MMDetection, RAFT, and Adam optimizer, but it does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	All Image Net-1K experiments trained on images of 224 224, the dense models baselines following the hyperparameter setting in (He et al., 2019). Speciﬁcally, all models are trained with batch size of 256 over 120 epochs and learning rates are annealed from 0.1 to 0 with a cosine scheduler and ﬁrst 5 epochs the learning rate linearly increases from 0 to 0.1.Each mini-batch on one GPU contains a set of sentence pairs with roughly 4,096 source and 4,096 target tokens. We use Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9 and β2 = 0.98. For our model, we train for 300,000 steps.