Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation
Authors: Yuchen Yang, Yingdong Shi, Cheems Wang, Xiantong Zhen, Yuxuan Shi, Jun Xu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to 30% of the peak memory usage. Our code is released at github. ... 6. Experiments In this section, we conduct experiments by deploying our Re GELU2, Re Si LU2, MS-LN, and MS-RMSNorm into the representative Vi T (Dosovitskiy et al., 2021) for vision tasks, as well as LLa MA (Touvron et al., 2023) and Ro BERTa (Liu et al., 2019) for natural language understanding tasks. |
| Researcher Affiliation | Collaboration | 1School of Statistics and Data Science, Nankai University, Tianjin, China 2School of Information Science and Technology, Shanghai Tech University, Shanghai, China 3Department of Automation, Tsinghua University, Peking, China 4Central Research Institute, United Imaging Healthcare, Co., Ltd.. Correspondence to: Jun Xu <nankaimathxujun@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Memory-Sharing Layer Normalization |
| Open Source Code | Yes | Our code is released at github. |
| Open Datasets | Yes | We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to 30% of the peak memory usage. Our code is released at github. ... fine-tuning pretrained Vi T-B (Dosovitskiy et al., 2021) with CIFAR10/100 (Krizhevsky et al., 2009) and FGVC (Jia et al., 2022). ... Image Net-22k (Deng et al., 2009; Dosovitskiy et al., 2021)... LLa MA-7B and LLa MA-13B (Touvron et al., 2023) using Alpaca (Taori et al., 2023)... GLUE (Wang et al., 2018)... PASCAL VOC (Everingham et al., 2015)... Squad-v2 (Rajpurkar et al., 2018). |
| Dataset Splits | No | The paper describes train and test set preprocessing for ViT experiments ('Normalize for the train set and Resize (to 224 224 px), Center Crop, Normalize for the test set'). However, it does not explicitly detail the percentages or method for splitting the primary training data into separate training, validation, and test sets. For LLaMA, it mentions evaluating on '5-shot MMLU', which is a benchmark, not a description of a validation split from the training data. |
| Hardware Specification | Yes | Vi T-base experiments are conducted with 1 2080Ti GPU and Vi T-large experiments are conducted with 1 L40 GPU. ... The training uses model parallel provided in the Transformers package (Wolf et al., 2020) with 2 H800 GPUs. ... with 2 RTX4090 GPUs. ... data parallel training using 4 RTX2080Ti. ... data parallel training by 4 RTX3060. ... using 4 RTX3060. |
| Software Dependencies | No | The paper mentions software like 'Pytorch', 'Transformers package', and 'Adam W' but does not specify their version numbers, which are critical for reproducibility of the software environment. |
| Experiment Setup | Yes | For experiments on fine-tuning Vi T-base and Vi T-large with Lo RA and Lo RA-FA... The batch size is set as 64. ... The base learning rate is 1.25e-3 in Lo RA and 1.25e-5 in Full Tuning. ... For experiments on fine-tuning LLa MA-7B and LLa MA-13B with QLo RA, the batch size is set as 4 and the number of gradient accumulation steps is set as 4. The total training iterations are 10000 steps. For LLa MA-7B, we use paged Adam W with no weight decay, tune constant learning rate in {10 4, 2 10 4}... For experiments on fine-tuning Ro BERTa-base with Lo RA, the batch size is set as 32. We use Adam W with the weight decay 0.01. All Ro BERTa-base models are fine-tuned from the pretrained model independently for 30 epochs. We use Linear learning rate scheduler with Warm Up ratio 0.1. The base learning rate for each task is chosen as the best one among {0.00005, 0.0001, 0.0005, 0.001, 0.005} in fine-tuning the baseline. |