GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Authors: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLa MA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning Ro BERTa on GLUE tasks.
Researcher Affiliation Collaboration 1California Institute of Technology 2Meta AI 3University of Texas at Austin 4Carnegie Mellon University.
Pseudocode Yes Algorithm 1: Ga Lore, Py Torch-like for weight in model.parameters():
Open Source Code No Code is provided in the link.
Open Datasets Yes To evaluate its performance, we apply Ga Lore to train LLa MA-based large language models on the C4 dataset. C4 dataset is a colossal, cleaned version of Common Crawl s web crawl corpus, which is mainly intended to pre-train language models and word representations (Raffel et al., 2020). We use GLUE tasks to benchmark Ga Lore against Lo RA for memoryefficient fine-tuning. GLUE is a benchmark for evaluating the performance of NLP models on a variety of tasks, including sentiment analysis, question answering, and textual entailment (Wang et al., 2019).
Dataset Splits Yes Table 2: Comparison with low-rank algorithms on pre-training various sizes of LLa MA models on C4 dataset. Validation perplexity is reported, along with a memory estimate of the total of parameters and optimizer states based on BF16 format.
Hardware Specification Yes All experiments run on NVIDIA A100 GPUs. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
Software Dependencies No The paper mentions optimizers like 'Adam W', '8-bit Adam', and 'Adafactor', and refers to 'Py Torch-like' implementation, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The details of our task setups and hyperparameters are provided in the appendix. Table 5 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 131K tokens. For all experiments, we adopt learning rate warmup for the first 10% of the training steps, and use cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. For all models, Ga Lore use the same hyperparameters, including the learning rate of 0.01, scale factor α of 0.25, and the subspace change frequency of T of 200.