Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

Authors: Jia-Hong Huang, Yixian Shen, Hongyi Zhu, Stevan Rudinac, Evangelos Kanoulas

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our 8-bit Grad Norm Lo RP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLa MA 7B, on consumerlevel GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, Grad Norm Lo RP outperforms existing low-rank methods in fine-tuning tasks.
Researcher Affiliation	Academia	Jia-Hong Huang, Yixian Shen, Hongyi Zhu, Stevan Rudinac, Evangelos Kanoulas University of Amsterdam EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Our proposed Grad Norm Lo RP
Open Source Code	Yes	Code https://github.com/Jhhuangkay/Gradient-Weightnormalized-Low-rank-Projection-for-Efficient-LLMTraining
Open Datasets	Yes	leveraging the C4 dataset (Raffel et al. 2020b) and the GLUE benchmark (Wang et al. 2019).
Dataset Splits	Yes	For fine-tuning, we use the GLUE benchmark (Wang et al. 2019)... For pre-training, we use the C4 dataset (Raffel et al. 2020b)...
Hardware Specification	Yes	we demonstrate the feasibility of pre-training the LLa MA 7B model on consumer-level GPUs with 24GB memory, such as the NVIDIA RTX 4090... The model is trained for 80K steps with 10.5B tokens, using 8-node parallel training on 32 A100 GPUs.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For fine-tuning, we evaluated our model on the GLUE benchmark, exploring learning rates in the range of {1e-4, 2e-4, 3e-4, 4e-4, 5e-4}, batch sizes of 16 and 32, and a fixed number of 30 epochs. Specifically, we used a batch size of 16 for all tasks except for Co LA, which used a batch size of 32. The maximum sequence length for all tasks was set to 512 for BERTbase, Ro BERTabase, Ro BERTalarge, and BARTbase models. For pretraining, we applied Grad Norm Lo RP across various model sizes ranging from 60M to 1B parameters. The hyperparameters for Grad Norm Lo RP were consistent across all models, with a learning rate of 0.01 and a scale factor (αs) of 0.25. The learning rate was fine-tuned from the set {1e-2, 1e-3,5e-4,1e-4}, selecting the best rate based on validation perplexity. Each model was pre-trained for 10,000 steps. For models scaled up to 7B parameters, we set the batch size to 16 and varied the training steps accordingly.