reproducibilityindex.ai

Advancing Dynamic Sparse Training by Exploring Optimization Opportunities

Authors: Jie Ji, Gen Li, Lu Yin, Minghai Qin, Geng Yuan, Linke Guo, Shiwei Liu, Xiaolong Ma

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we carry out experiments to comprehensively demonstrate the advantages of Bi DST. We evaluate Bi DST in comparison with the state-of-the-art (SOTA) DST methods, and show superior accuracy, effective mask searching ability, as well as great applicability for implementations. We follow the traditional network and dataset selection used in prior DST methods. We use Res Net-32 (Zagoruyko & Komodakis, 2016), VGG-19 (Simonyan & Zisserman, 2014) and Mobile Net-v2 (Sandler et al., 2018) on CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009), and we use Res Net-34 and Res Net-50 (He et al., 2016) on Image Net-1K dataset (Deng et al., 2009).
Researcher Affiliation	Academia	1Clemson University, USA 2University of Aberdeen, Scotland 3University of Georgia, USA 4University of Oxford, England.
Pseudocode	Yes	Algorithm 1 Bi DST implementation details Input: A DNN model with randomly initialized weight θ0; a random mask ψ0 with sparsity s. A flag parameter F indicating when to change the subnetwork topology based on ψ. Output: A sparse model satisfying the target sparsity s. Set t = 0. Set the number of non-zero weights to be k = s \|θ\|. while t < T do ψ t Binarize(ψt, argmax(ψt, k)) Train the subnetwork f(ψ t θt) by solving Eq. 10. Update mask ψ by solving Eq. 11. t = t + 1.
Open Source Code	Yes	Code available at https://github.com/jjsrf/ Bi DST-ICML2024.
Open Datasets	Yes	We use Res Net-32 (Zagoruyko & Komodakis, 2016), VGG-19 (Simonyan & Zisserman, 2014) and Mobile Net-v2 (Sandler et al., 2018) on CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009), and we use Res Net-34 and Res Net-50 (He et al., 2016) on Image Net-1K dataset (Deng et al., 2009).
Dataset Splits	No	The paper uses standard datasets (CIFAR-10/100, ImageNet-1K) which typically have predefined splits, and mentions 'We use standard data augmentation' and 'standard training recipe'. However, it does not explicitly state the percentages or sample counts for training, validation, and test splits within the paper's text, nor does it provide specific details on how these splits were performed for reproducibility, beyond implying standard usage.
Hardware Specification	Yes	We test the on-device training performance using a Samsung Galaxy S21 with Snapdragon 888 chipset. For DST, the on-device computation is done by static code (e.g., Open CL for GPU and C++ for CPU), and the training acceleration is obtained from compiler optimization that skips the zeros in weights. ... implemented on Snapdragon 888.
Software Dependencies	No	We extend the code generation of TVM (Chen et al., 2018) and design a training engine on the Snapdragon 888. For DST, the on-device computation is done by static code (e.g., Open CL for GPU and C++ for CPU)... The paper mentions software names but not specific version numbers.
Experiment Setup	Yes	We use standard training recipe following Yuan et al. (2021) and Wang et al. (2020b). To ensure fair comparison, all Bi DST experiments have a slight scale down on the number of training epochs to compensate the mask learning computation cost. We use standard data augmentation, and cosine annealing learning rate schedule is used with SGD optimizer. For CIFAR-10/100, we use a batch size of 64 and set the initial learning rate to 0.1. For Image Net-1K, we use a batch size of 1024 and learning rate of 1.024 with a linear warp-up for 5 epochs. Due to limited space, we put detailed settings in Appendix A. Table A.1: Hyperparameter settings. ... Training epochs 158 ... Batch size 64 ... Initial learning rate 0.1 ... Mask update frequency 8 ... Regularization coefficient λ 1e-4