Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation
Authors: Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, Zhangyang Wang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive transfer learning experiments on both Convolutional Neural Networks and Vision Transformers with classification, dense prediction, and language modeling tasks show that Back Razor could yield up to 97% sparsity, saving 9.2x memory usage, without losing accuracy. |
| Researcher Affiliation | Collaboration | Texas A&M University University of Texas at Austin Google {jiangziyu,xueq13}@tamu.edu,{xxchen,atlaswang}@utexas.edu {xianzhi,dennyzhou}@google.com |
| Pseudocode | Yes | Algorithm 1 Backward and Update with Sparse Activation |
| Open Source Code | Yes | The code is available at: https://github.com/VITA-Group/Back Razor_Neurips22. |
| Open Datasets | Yes | Specifically, we employ Image Net-1K and Imagenet-22K for CNNs and Vi Ts, respectively. For downstream fine-tuning, we consider eight datasets: Pets [43], Aircraft [44], CIFAR10, CIFAR100 [45], Flowers [46], Cars [47], CUB [48], and Food [49]. |
| Dataset Splits | Yes | Specifically, we employ Image Net-1K and Imagenet-22K for CNNs and Vi Ts, respectively. For downstream fine-tuning, we consider eight datasets: Pets [43], Aircraft [44], CIFAR10, CIFAR100 [45], Flowers [46], Cars [47], CUB [48], and Food [49]. |
| Hardware Specification | Yes | Our experiments are implemented with Pytorch [50] and conducted on 1080 Ti or V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Pytorch [50]' but does not provide a specific version number for it or any other software dependencies. |
| Experiment Setup | Yes | The fine-tuning epochs and batch size is set as 50 epochs and 8, respectively. The model is optimized with adam [53] optimizer and cosine learning learning rate schedule [54]. The initial learning rate is tuned for each dataset. We employ the standard SGD optimizer with cosine learning rate decay for finetuning. The training steps are fixed as 20k and the initial learning rate is tuned for each dataset. We employ a larger batch size of 128 following the common practice for accelerating training [40, 56, 32]. |