Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Low-rank Momentum Factorization for Memory Efficient Training
Authors: Pouria Mahdavinia, Mehrdad Mahdavi
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Mo Fa SGD s effectiveness and efficiency across three large language modeling setups: pretraining, natural language understanding (NLU) fine-tuning, and instruction-tuning. |
| Researcher Affiliation | Academia | Pouria Mahdavinia EMAIL Department of Computer Science and Engineering The Pennsylvania State University Mehrdad Mahdavi EMAIL Department of Computer Science and Engineering The Pennsylvania State University |
| Pseudocode | Yes | Algorithm 1 Mo Fa SGD: Momentum Factorized Stochastic Gradient Descent |
| Open Source Code | Yes | Our implementation is available at https://github.com/pmahdavi/Mo Fa SGD. |
| Open Datasets | Yes | Fine Web dataset (Penedo et al., 2025)... GLUE benchmark (Wang, 2018)... tulu-3-sft-mixture dataset (Lambert et al., 2024). |
| Dataset Splits | Yes | measures performance using validation perplexity on a held-out partition of Fine Web... We used 5% of the sampled dataset for validation. |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | Experiments were implemented using standard libraries for deep learning, including Py Torch, Hugging Face Transformers, and Accelerate. Specific library versions are detailed in the code repository. |
| Experiment Setup | Yes | Key hyperparameters selected for the Nano GPT pre-training experiments are summarized in Table 5. ... Learning rates for Adam W, Ga Lore, and Mo Fa SGD were tuned via grid search over {1e 4, 2e 4, 3e 4, 5e 4, 8e 4, 1e 3, 3e 3, 5e 3, 8e 3, 1e 2, 2e 2, 5e 2}. Mo Fa SGD s momentum decay β was tuned over {0.5, 0.85, 0.90, 0.95}. Ga Lore s SVD frequency was tuned over {10, 25, 75, 150, 300}. |