reproducibilityindex.ai

Bi-directional Masks for Efficient N:M Sparse Training

Authors: Yuxin Zhang, Yiting Luo, Mingbao Lin, Yunshan Zhong, Jingjing Xie, Fei Chao, Rongrong Ji

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on representative benchmarks for image classification. For small-scale dataset, we choose the CIFAR-10 dataset (Krizhevsky et al., 2009)... For large-scale dataset, we choose the challenging Image Net (Deng et al., 2009)...
Researcher Affiliation	Collaboration	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China 2Tencent Youtu Lab, Shanghai, China 3Institute of Artificial Intelligence, Xiamen University, Xiamen, China 4Pengcheng Lab, Shenzhen, China.
Pseudocode	Yes	Algorithm 1 Bi-Mask for Efficient N:M Sparse Training.
Open Source Code	Yes	Project of this paper is available at https: //github.com/zyxxmu/Bi-Mask.
Open Datasets	Yes	For small-scale dataset, we choose the CIFAR-10 dataset (Krizhevsky et al., 2009)... For large-scale dataset, we choose the challenging Image Net (Deng et al., 2009)...
Dataset Splits	Yes	For large-scale dataset, we choose the challenging Image Net (Deng et al., 2009), which contains over 1.2 million images for training and 50,000 validation images in 1,000 categories.
Hardware Specification	Yes	All experiments are conducted on the NVIDIA Tesla A100 GPUs.
Software Dependencies	No	Our implementation of Bi-Mask is based on the Py Torch framework (Paszke et al., 2019). ... For Dei T-small, we follow (Zhang et al., 2022) to train for 300 epochs in total using the timm framework (Wightman, 2019). (No version numbers provided for PyTorch or timm framework.)
Experiment Setup	Yes	The training iteration interval T is set to 100 and the number of permutation candidates K is set to 100. We use the stochastic gradient descent (SGD) optimizer to perform sparse training. In the first 5 training epochs, the learning rate linearly increases from 0 to 0.1 and then is decayed using the cosine annealing (Loshchilov & Hutter, 2017). The momentum and batch size are respectively set to 0.9 and 256. On CIFAR-10, we train all networks for 300 epochs with a weight decay of 1 10 3. On Image Net, we follow (Zhou et al., 2021) to train Res Net-18/50 for a total of 120 epochs.