reproducibilityindex.ai

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Authors: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efﬁciency and performance for pre-training on LLa MA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on ﬁne-tuning Ro BERTa on GLUE tasks.
Researcher Affiliation	Collaboration	1California Institute of Technology 2Meta AI 3University of Texas at Austin 4Carnegie Mellon University.
Pseudocode	Yes	Algorithm 1: Ga Lore, Py Torch-like for weight in model.parameters():
Open Source Code	No	Code is provided in the link.
Open Datasets	Yes	To evaluate its performance, we apply Ga Lore to train LLa MA-based large language models on the C4 dataset. C4 dataset is a colossal, cleaned version of Common Crawl s web crawl corpus, which is mainly intended to pre-train language models and word representations (Raffel et al., 2020). We use GLUE tasks to benchmark Ga Lore against Lo RA for memoryefﬁcient ﬁne-tuning. GLUE is a benchmark for evaluating the performance of NLP models on a variety of tasks, including sentiment analysis, question answering, and textual entailment (Wang et al., 2019).
Dataset Splits	Yes	Table 2: Comparison with low-rank algorithms on pre-training various sizes of LLa MA models on C4 dataset. Validation perplexity is reported, along with a memory estimate of the total of parameters and optimizer states based on BF16 format.
Hardware Specification	Yes	All experiments run on NVIDIA A100 GPUs. Notably, we demonstrate, for the ﬁrst time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or ofﬂoading strategies.
Software Dependencies	No	The paper mentions optimizers like 'Adam W', '8-bit Adam', and 'Adafactor', and refers to 'Py Torch-like' implementation, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	The details of our task setups and hyperparameters are provided in the appendix. Table 5 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 131K tokens. For all experiments, we adopt learning rate warmup for the ﬁrst 10% of the training steps, and use cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. For all models, Ga Lore use the same hyperparameters, including the learning rate of 0.01, scale factor α of 0.25, and the subspace change frequency of T of 200.