Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Gradient Weight-normalized Low-rank Projection for Efficient LLM Training
Authors: Jia-Hong Huang, Yixian Shen, Hongyi Zhu, Stevan Rudinac, Evangelos Kanoulas
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our 8-bit Grad Norm Lo RP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLa MA 7B, on consumerlevel GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, Grad Norm Lo RP outperforms existing low-rank methods in fine-tuning tasks. |
| Researcher Affiliation | Academia | Jia-Hong Huang*, Yixian Shen*, Hongyi Zhu, Stevan Rudinac, Evangelos Kanoulas University of Amsterdam EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Our proposed Grad Norm Lo RP |
| Open Source Code | Yes | Code https://github.com/Jhhuangkay/Gradient-Weightnormalized-Low-rank-Projection-for-Efficient-LLMTraining |
| Open Datasets | Yes | leveraging the C4 dataset (Raffel et al. 2020b) and the GLUE benchmark (Wang et al. 2019). |
| Dataset Splits | Yes | For fine-tuning, we use the GLUE benchmark (Wang et al. 2019)... For pre-training, we use the C4 dataset (Raffel et al. 2020b)... |
| Hardware Specification | Yes | we demonstrate the feasibility of pre-training the LLa MA 7B model on consumer-level GPUs with 24GB memory, such as the NVIDIA RTX 4090... The model is trained for 80K steps with 10.5B tokens, using 8-node parallel training on 32 A100 GPUs. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For fine-tuning, we evaluated our model on the GLUE benchmark, exploring learning rates in the range of {1e-4, 2e-4, 3e-4, 4e-4, 5e-4}, batch sizes of 16 and 32, and a fixed number of 30 epochs. Specifically, we used a batch size of 16 for all tasks except for Co LA, which used a batch size of 32. The maximum sequence length for all tasks was set to 512 for BERTbase, Ro BERTabase, Ro BERTalarge, and BARTbase models. For pretraining, we applied Grad Norm Lo RP across various model sizes ranging from 60M to 1B parameters. The hyperparameters for Grad Norm Lo RP were consistent across all models, with a learning rate of 0.01 and a scale factor (Îąs) of 0.25. The learning rate was fine-tuned from the set {1e-2, 1e-3,5e-4,1e-4}, selecting the best rate based on validation perplexity. Each model was pre-trained for 10,000 steps. For models scaled up to 7B parameters, we set the batch size to 16 and varied the training steps accordingly. |