Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

Authors: Tianlong Chen, Zhenyu Zhang, pengjun wang, Santosh Balachandra, Haoyu Ma, Zehao Wang, Zhangyang Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate our proposals with multiple network architectures on diverse datasets, including CIFAR-10/100 and Tiny Image Net. For example, our methods reduce robust generalization gap and overfitting by 34.44% and 4.02%, with comparable robust/standard accuracy boosts and 87.83%/87.82% training/inference FLOPs savings on CIFAR-100 with Res Net18. Besides, our approaches can be organically combined with existing regularizers, establishing new state-of-the-art results in AT.
Researcher Affiliation Academia 1University of Texas at Austin, 2University of Science and Technology of China, 3University of California, Irvine {tianlong.chen,santoshb,atlaswang}@utexas.edu, {zzy19969,wpj520,wangze}@mail.ustc.edu.cn, haoyum3@uci.edu
Pseudocode Yes A detailed algorithmic implementation is provided in Algorithm 1 of Appendix A1. Detailed procedures are summarized in Algorithm 2 of Appendix A1.
Open Source Code Yes Codes are available in https: //github.com/VITA-Group/Sparsity-Win-Robust-Generalization.
Open Datasets Yes Our experiments consider two popular architectures, Res Net-18 (He et al., 2016), VGG-16 (Simonyan & Zisserman, 2014) on three representative datasets, CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) and Tiny-Image Net (Deng et al., 2009).
Dataset Splits Yes We randomly split one-tenth of the training samples as the validation dataset, and the performance is reported on the official testing dataset. We select two checkpoints during training: best, which has the best RA values on the validation set, and final, i.e., the last checkpoint.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper mentions using 'PGD-based adversarial training' and an 'SGD optimizer', but does not specify software names with version numbers (e.g., PyTorch version, CUDA version).
Experiment Setup Yes We implement our experiments with the original PGD-based adversarial trainig (Madry et al., 2018b), in which we train the network against ℓ adversary with maximum perturbations ϵ of 8/255. 10-steps PGD for training and 20-steps PGD for evaluation are chosen with a step size α of 2/255, following Madry et al. (2018b); Chen et al. (2021e). For each experiment, we train the network for 200 epochs with an SGD optimizer, whose momentum and weight decay are kept to 0.9 and 5 10 4, respectively. The learning rate starts from 0.1 that decays by 10 times at 100,150 epoch and the batch size is 128, which follows Rice et al. (2020). For Robust Bird, the threshold τ of mask distance is set as 0.1. In Flying Birds(+), we calculate the layer-wise sparsity by Ideal Gas Quotas (IGQ) (Vysogorets & Kempe, 2021) and then apply random pruning to initialize the sparse masks. FB updates the sparse connectivity per 2000 iterations of AT, with an update ratio k that starts from 50% and decays by cosine annealing. More details are referred to Appendix A2. Hyperparameters are either tuned by grid search or following Liu et al. (2021b).