Picking Winning Tickets Before Training by Preserving Gradient Flow

Authors: Chaoqi Wang, Guodong Zhang, Roger Grosse

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-Image Net and Image Net, using VGGNet and Res Net architectures. Our method can prune 80% of the weights of a VGG-16 network on Image Net at initialization, with only a 1.6% drop in top-1 accuracy. Moreover, our method achieves significantly better performance than the baseline at extreme sparsity levels.
Researcher Affiliation Academia Chaoqi Wang, Guodong Zhang, Roger Grosse University of Toronto, Vector Institute {cqwang, gdzhang, rgrosse}@cs.toronto.edu
Pseudocode Yes Algorithm 1 Gradient Signal Preservation (Gra SP).; Algorithm 2 Hessian-gradient Product.
Open Source Code Yes Our code is made public at: https://github.com/alecwangcq/Gra SP.
Open Datasets Yes We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-Image Net and Image Net, using VGGNet and Res Net architectures. ... (Krizhevsky, 2009) ... (Deng et al., 2009)
Dataset Splits No The paper describes training parameters (epochs, learning rate, batch size) and mentions
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU specifications, or cloud instance types.
Software Dependencies No The paper mentions using “Pytorch (Paszke et al., 2017) official implementation” but does not specify a version number for PyTorch or any other software libraries or dependencies.
Experiment Setup Yes The pruned network is trained with Kaiming initialization (He et al., 2015) using SGD for 160 epochs for CIFAR-10/100, and 300 epochs for Tiny-Image Net, with an initial learning rate of 0.1 and batch size 128. The learning rate is decayed by a factor of 0.1 at 1/2 and 3/4 of the total number of epochs. ... For Image Net, we adopt the Pytorch (Paszke et al., 2017) official implementation, but we used more epochs for training according to Liu et al. (2019). Specifically, we train the pruned networks with SGD for 150 epochs, and decay the learning rate by a factor of 0.1 every 50 epochs.