Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Authors: Yuchen Yang, Yingdong Shi, Cheems Wang, Xiantong Zhen, Yuxuan Shi, Jun Xu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to 30% of the peak memory usage. Our code is released at github. ... 6. Experiments In this section, we conduct experiments by deploying our Re GELU2, Re Si LU2, MS-LN, and MS-RMSNorm into the representative Vi T (Dosovitskiy et al., 2021) for vision tasks, as well as LLa MA (Touvron et al., 2023) and Ro BERTa (Liu et al., 2019) for natural language understanding tasks.
Researcher Affiliation Collaboration 1School of Statistics and Data Science, Nankai University, Tianjin, China 2School of Information Science and Technology, Shanghai Tech University, Shanghai, China 3Department of Automation, Tsinghua University, Peking, China 4Central Research Institute, United Imaging Healthcare, Co., Ltd.. Correspondence to: Jun Xu <nankaimathxujun@gmail.com>.
Pseudocode Yes Algorithm 1 Memory-Sharing Layer Normalization
Open Source Code Yes Our code is released at github.
Open Datasets Yes We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to 30% of the peak memory usage. Our code is released at github. ... fine-tuning pretrained Vi T-B (Dosovitskiy et al., 2021) with CIFAR10/100 (Krizhevsky et al., 2009) and FGVC (Jia et al., 2022). ... Image Net-22k (Deng et al., 2009; Dosovitskiy et al., 2021)... LLa MA-7B and LLa MA-13B (Touvron et al., 2023) using Alpaca (Taori et al., 2023)... GLUE (Wang et al., 2018)... PASCAL VOC (Everingham et al., 2015)... Squad-v2 (Rajpurkar et al., 2018).
Dataset Splits No The paper describes train and test set preprocessing for ViT experiments ('Normalize for the train set and Resize (to 224 224 px), Center Crop, Normalize for the test set'). However, it does not explicitly detail the percentages or method for splitting the primary training data into separate training, validation, and test sets. For LLaMA, it mentions evaluating on '5-shot MMLU', which is a benchmark, not a description of a validation split from the training data.
Hardware Specification Yes Vi T-base experiments are conducted with 1 2080Ti GPU and Vi T-large experiments are conducted with 1 L40 GPU. ... The training uses model parallel provided in the Transformers package (Wolf et al., 2020) with 2 H800 GPUs. ... with 2 RTX4090 GPUs. ... data parallel training using 4 RTX2080Ti. ... data parallel training by 4 RTX3060. ... using 4 RTX3060.
Software Dependencies No The paper mentions software like 'Pytorch', 'Transformers package', and 'Adam W' but does not specify their version numbers, which are critical for reproducibility of the software environment.
Experiment Setup Yes For experiments on fine-tuning Vi T-base and Vi T-large with Lo RA and Lo RA-FA... The batch size is set as 64. ... The base learning rate is 1.25e-3 in Lo RA and 1.25e-5 in Full Tuning. ... For experiments on fine-tuning LLa MA-7B and LLa MA-13B with QLo RA, the batch size is set as 4 and the number of gradient accumulation steps is set as 4. The total training iterations are 10000 steps. For LLa MA-7B, we use paged Adam W with no weight decay, tune constant learning rate in {10 4, 2 10 4}... For experiments on fine-tuning Ro BERTa-base with Lo RA, the batch size is set as 32. We use Adam W with the weight decay 0.01. All Ro BERTa-base models are fine-tuned from the pretrained model independently for 30 epochs. We use Linear learning rate scheduler with Warm Up ratio 0.1. The base learning rate for each task is chosen as the best one among {0.00005, 0.0001, 0.0005, 0.001, 0.005} in fine-tuning the baseline.