Advancing Dynamic Sparse Training by Exploring Optimization Opportunities

Authors: Jie Ji, Gen Li, Lu Yin, Minghai Qin, Geng Yuan, Linke Guo, Shiwei Liu, Xiaolong Ma

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we carry out experiments to comprehensively demonstrate the advantages of Bi DST. We evaluate Bi DST in comparison with the state-of-the-art (SOTA) DST methods, and show superior accuracy, effective mask searching ability, as well as great applicability for implementations. We follow the traditional network and dataset selection used in prior DST methods. We use Res Net-32 (Zagoruyko & Komodakis, 2016), VGG-19 (Simonyan & Zisserman, 2014) and Mobile Net-v2 (Sandler et al., 2018) on CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009), and we use Res Net-34 and Res Net-50 (He et al., 2016) on Image Net-1K dataset (Deng et al., 2009).
Researcher Affiliation Academia 1Clemson University, USA 2University of Aberdeen, Scotland 3University of Georgia, USA 4University of Oxford, England.
Pseudocode Yes Algorithm 1 Bi DST implementation details Input: A DNN model with randomly initialized weight θ0; a random mask ψ0 with sparsity s. A flag parameter F indicating when to change the subnetwork topology based on ψ. Output: A sparse model satisfying the target sparsity s. Set t = 0. Set the number of non-zero weights to be k = s |θ|. while t < T do ψ t Binarize(ψt, argmax(ψt, k)) Train the subnetwork f(ψ t θt) by solving Eq. 10. Update mask ψ by solving Eq. 11. t = t + 1.
Open Source Code Yes Code available at https://github.com/jjsrf/ Bi DST-ICML2024.
Open Datasets Yes We use Res Net-32 (Zagoruyko & Komodakis, 2016), VGG-19 (Simonyan & Zisserman, 2014) and Mobile Net-v2 (Sandler et al., 2018) on CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009), and we use Res Net-34 and Res Net-50 (He et al., 2016) on Image Net-1K dataset (Deng et al., 2009).
Dataset Splits No The paper uses standard datasets (CIFAR-10/100, ImageNet-1K) which typically have predefined splits, and mentions 'We use standard data augmentation' and 'standard training recipe'. However, it does not explicitly state the percentages or sample counts for training, validation, and test splits within the paper's text, nor does it provide specific details on how these splits were performed for reproducibility, beyond implying standard usage.
Hardware Specification Yes We test the on-device training performance using a Samsung Galaxy S21 with Snapdragon 888 chipset. For DST, the on-device computation is done by static code (e.g., Open CL for GPU and C++ for CPU), and the training acceleration is obtained from compiler optimization that skips the zeros in weights. ... implemented on Snapdragon 888.
Software Dependencies No We extend the code generation of TVM (Chen et al., 2018) and design a training engine on the Snapdragon 888. For DST, the on-device computation is done by static code (e.g., Open CL for GPU and C++ for CPU)... The paper mentions software names but not specific version numbers.
Experiment Setup Yes We use standard training recipe following Yuan et al. (2021) and Wang et al. (2020b). To ensure fair comparison, all Bi DST experiments have a slight scale down on the number of training epochs to compensate the mask learning computation cost. We use standard data augmentation, and cosine annealing learning rate schedule is used with SGD optimizer. For CIFAR-10/100, we use a batch size of 64 and set the initial learning rate to 0.1. For Image Net-1K, we use a batch size of 1024 and learning rate of 1.024 with a linear warp-up for 5 epochs. Due to limited space, we put detailed settings in Appendix A. Table A.1: Hyperparameter settings. ... Training epochs 158 ... Batch size 64 ... Initial learning rate 0.1 ... Mask update frequency 8 ... Regularization coefficient λ 1e-4