Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Authors: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLa MA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning Ro BERTa on GLUE tasks. |
| Researcher Affiliation | Collaboration | 1California Institute of Technology 2Meta AI 3University of Texas at Austin 4Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1: Ga Lore, Py Torch-like for weight in model.parameters(): |
| Open Source Code | No | Code is provided in the link. |
| Open Datasets | Yes | To evaluate its performance, we apply Ga Lore to train LLa MA-based large language models on the C4 dataset. C4 dataset is a colossal, cleaned version of Common Crawl s web crawl corpus, which is mainly intended to pre-train language models and word representations (Raffel et al., 2020). We use GLUE tasks to benchmark Ga Lore against Lo RA for memoryefficient fine-tuning. GLUE is a benchmark for evaluating the performance of NLP models on a variety of tasks, including sentiment analysis, question answering, and textual entailment (Wang et al., 2019). |
| Dataset Splits | Yes | Table 2: Comparison with low-rank algorithms on pre-training various sizes of LLa MA models on C4 dataset. Validation perplexity is reported, along with a memory estimate of the total of parameters and optimizer states based on BF16 format. |
| Hardware Specification | Yes | All experiments run on NVIDIA A100 GPUs. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam W', '8-bit Adam', and 'Adafactor', and refers to 'Py Torch-like' implementation, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The details of our task setups and hyperparameters are provided in the appendix. Table 5 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 131K tokens. For all experiments, we adopt learning rate warmup for the first 10% of the training steps, and use cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. For all models, Ga Lore use the same hyperparameters, including the learning rate of 0.01, scale factor α of 0.25, and the subspace change frequency of T of 200. |