Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation

Authors: Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, Zhangyang Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive transfer learning experiments on both Convolutional Neural Networks and Vision Transformers with classification, dense prediction, and language modeling tasks show that Back Razor could yield up to 97% sparsity, saving 9.2x memory usage, without losing accuracy.
Researcher Affiliation Collaboration Texas A&M University University of Texas at Austin Google {jiangziyu,xueq13}@tamu.edu,{xxchen,atlaswang}@utexas.edu {xianzhi,dennyzhou}@google.com
Pseudocode Yes Algorithm 1 Backward and Update with Sparse Activation
Open Source Code Yes The code is available at: https://github.com/VITA-Group/Back Razor_Neurips22.
Open Datasets Yes Specifically, we employ Image Net-1K and Imagenet-22K for CNNs and Vi Ts, respectively. For downstream fine-tuning, we consider eight datasets: Pets [43], Aircraft [44], CIFAR10, CIFAR100 [45], Flowers [46], Cars [47], CUB [48], and Food [49].
Dataset Splits Yes Specifically, we employ Image Net-1K and Imagenet-22K for CNNs and Vi Ts, respectively. For downstream fine-tuning, we consider eight datasets: Pets [43], Aircraft [44], CIFAR10, CIFAR100 [45], Flowers [46], Cars [47], CUB [48], and Food [49].
Hardware Specification Yes Our experiments are implemented with Pytorch [50] and conducted on 1080 Ti or V100 GPUs.
Software Dependencies No The paper mentions 'Pytorch [50]' but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes The fine-tuning epochs and batch size is set as 50 epochs and 8, respectively. The model is optimized with adam [53] optimizer and cosine learning learning rate schedule [54]. The initial learning rate is tuned for each dataset. We employ the standard SGD optimizer with cosine learning rate decay for finetuning. The training steps are fixed as 20k and the initial learning rate is tuned for each dataset. We employ a larger batch size of 128 following the common practice for accelerating training [40, 56, 32].