LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Authors: Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
Researcher Affiliation Collaboration Taeho Kim1 , Yanming Wang2 , Vatshank Chaturvedi2 , Lokesh Gupta2 , Seyeon Kim1 , Yongin Kwon3 , Sangtae Ha1 1University of Colorado Boulder 2Amazon Web Services 3Electronics and Telecommunications Research Institute {taeho.kim,seyeon.kim,sangtae.ha}@colorado.edu, {yanmwang,vatshc,lokeshgu}@amazon.com, yongin.kwon@etri.re.kr
Pseudocode Yes Algorithm 1 Distributed Fine-Tuning Method Decision
Open Source Code Yes Our source code repository can be found at https://github.com/taehokim20/LLMem.
Open Datasets Yes The dataset used is alpaca data [Taori et al., 2023], which is 52K instruction-following data.
Dataset Splits No The paper uses the 'alpaca data' for fine-tuning but does not specify explicit training, validation, or test dataset splits; the experiments focus on GPU memory usage during fine-tuning rather than dataset partitioning for model evaluation.
Hardware Specification Yes For a multi-GPU environment, we use a Tesla V100 (total GPU memory capacity: 16384 MB) with 4 GPUs in Cloud Lab [Cloud Lab, 2024]. We also use the Colossal-AI [Li et al., 2023]2, a widely used framework for applying distributed fine-tuning methods, and Py Torch 2.0.1 with CUDA 11.7. The environment we used in the experiment was Py Torch 2.0.1 with CUDA 11.7 on NVIDIA RTX2060, which differs from Py Torch 1.2.0 with CUDA 9.0 on NVIDIA Tesla P40 used in the DNNMem paper.
Software Dependencies Yes We also use the Colossal-AI [Li et al., 2023]2, a widely used framework for applying distributed fine-tuning methods, and Py Torch 2.0.1 with CUDA 11.7.
Experiment Setup Yes LLMem takes a pre-trained model M, the total number of GPUs to fine-tune gpun, and the maximum sequence length sl. ... We measure peak GPU memory usage using only the maximum sequence length of 512.