GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Authors: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLa MA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning Ro BERTa on GLUE tasks. |
| Researcher Affiliation | Collaboration | 1California Institute of Technology 2Meta AI 3University of Texas at Austin 4Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1: Ga Lore, Py Torch-like for weight in model.parameters(): |
| Open Source Code | No | Code is provided in the link. |
| Open Datasets | Yes | To evaluate its performance, we apply Ga Lore to train LLa MA-based large language models on the C4 dataset. C4 dataset is a colossal, cleaned version of Common Crawl s web crawl corpus, which is mainly intended to pre-train language models and word representations (Raffel et al., 2020). We use GLUE tasks to benchmark Ga Lore against Lo RA for memoryefficient fine-tuning. GLUE is a benchmark for evaluating the performance of NLP models on a variety of tasks, including sentiment analysis, question answering, and textual entailment (Wang et al., 2019). |
| Dataset Splits | Yes | Table 2: Comparison with low-rank algorithms on pre-training various sizes of LLa MA models on C4 dataset. Validation perplexity is reported, along with a memory estimate of the total of parameters and optimizer states based on BF16 format. |
| Hardware Specification | Yes | All experiments run on NVIDIA A100 GPUs. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam W', '8-bit Adam', and 'Adafactor', and refers to 'Py Torch-like' implementation, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The details of our task setups and hyperparameters are provided in the appendix. Table 5 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 131K tokens. For all experiments, we adopt learning rate warmup for the first 10% of the training steps, and use cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. For all models, Ga Lore use the same hyperparameters, including the learning rate of 0.01, scale factor α of 0.25, and the subspace change frequency of T of 200. |