Bi-directional Masks for Efficient N:M Sparse Training

Authors: Yuxin Zhang, Yiting Luo, Mingbao Lin, Yunshan Zhong, Jingjing Xie, Fei Chao, Rongrong Ji

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on representative benchmarks for image classification. For small-scale dataset, we choose the CIFAR-10 dataset (Krizhevsky et al., 2009)... For large-scale dataset, we choose the challenging Image Net (Deng et al., 2009)...
Researcher Affiliation Collaboration 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China 2Tencent Youtu Lab, Shanghai, China 3Institute of Artificial Intelligence, Xiamen University, Xiamen, China 4Pengcheng Lab, Shenzhen, China.
Pseudocode Yes Algorithm 1 Bi-Mask for Efficient N:M Sparse Training.
Open Source Code Yes Project of this paper is available at https: //github.com/zyxxmu/Bi-Mask.
Open Datasets Yes For small-scale dataset, we choose the CIFAR-10 dataset (Krizhevsky et al., 2009)... For large-scale dataset, we choose the challenging Image Net (Deng et al., 2009)...
Dataset Splits Yes For large-scale dataset, we choose the challenging Image Net (Deng et al., 2009), which contains over 1.2 million images for training and 50,000 validation images in 1,000 categories.
Hardware Specification Yes All experiments are conducted on the NVIDIA Tesla A100 GPUs.
Software Dependencies No Our implementation of Bi-Mask is based on the Py Torch framework (Paszke et al., 2019). ... For Dei T-small, we follow (Zhang et al., 2022) to train for 300 epochs in total using the timm framework (Wightman, 2019). (No version numbers provided for PyTorch or timm framework.)
Experiment Setup Yes The training iteration interval T is set to 100 and the number of permutation candidates K is set to 100. We use the stochastic gradient descent (SGD) optimizer to perform sparse training. In the first 5 training epochs, the learning rate linearly increases from 0 to 0.1 and then is decayed using the cosine annealing (Loshchilov & Hutter, 2017). The momentum and batch size are respectively set to 0.9 and 256. On CIFAR-10, we train all networks for 300 epochs with a weight decay of 1 10 3. On Image Net, we follow (Zhou et al., 2021) to train Res Net-18/50 for a total of 120 epochs.