reproducibilityindex.ai

Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation

Authors: Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, Zhangyang Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive transfer learning experiments on both Convolutional Neural Networks and Vision Transformers with classification, dense prediction, and language modeling tasks show that Back Razor could yield up to 97% sparsity, saving 9.2x memory usage, without losing accuracy.
Researcher Affiliation	Collaboration	Texas A&M University University of Texas at Austin Google {jiangziyu,xueq13}@tamu.edu,{xxchen,atlaswang}@utexas.edu {xianzhi,dennyzhou}@google.com
Pseudocode	Yes	Algorithm 1 Backward and Update with Sparse Activation
Open Source Code	Yes	The code is available at: https://github.com/VITA-Group/Back Razor_Neurips22.
Open Datasets	Yes	Specifically, we employ Image Net-1K and Imagenet-22K for CNNs and Vi Ts, respectively. For downstream fine-tuning, we consider eight datasets: Pets [43], Aircraft [44], CIFAR10, CIFAR100 [45], Flowers [46], Cars [47], CUB [48], and Food [49].
Dataset Splits	Yes	Specifically, we employ Image Net-1K and Imagenet-22K for CNNs and Vi Ts, respectively. For downstream fine-tuning, we consider eight datasets: Pets [43], Aircraft [44], CIFAR10, CIFAR100 [45], Flowers [46], Cars [47], CUB [48], and Food [49].
Hardware Specification	Yes	Our experiments are implemented with Pytorch [50] and conducted on 1080 Ti or V100 GPUs.
Software Dependencies	No	The paper mentions 'Pytorch [50]' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup	Yes	The fine-tuning epochs and batch size is set as 50 epochs and 8, respectively. The model is optimized with adam [53] optimizer and cosine learning learning rate schedule [54]. The initial learning rate is tuned for each dataset. We employ the standard SGD optimizer with cosine learning rate decay for finetuning. The training steps are fixed as 20k and the initial learning rate is tuned for each dataset. We employ a larger batch size of 128 following the common practice for accelerating training [40, 56, 32].