reproducibilityindex.ai

Memory-Efficient LLM Training with Online Subspace Descent

Authors: Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that for the task of pretraining LLa MA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.
Researcher Affiliation	Academia	Kaizhao Liang , Bo Liu , Lizhang Chen , Qiang Liu The University of Texas at Austin {kaizhaol,bliu,lzchen,lqiang}@utexas.edu
Pseudocode	Yes	Algorithm 1 Online Subspace Descent
Open Source Code	Yes	Code is available at https://github.com/kyleliang919/Online-Subspace-Descent.
Open Datasets	Yes	pretraining LLa MA models ranging from 60M to 7B parameters on the C4 dataset [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ar Xiv e-prints, 2019.
Dataset Splits	No	The paper mentions using a validation set for perplexity (e.g., Figure 1 caption: "Validation perplexity of LLa MA 1B...") but does not provide specific percentages or sample counts for the training, validation, and test splits of the C4 dataset used for pretraining.
Hardware Specification	Yes	All experiments except for large 7B experiments are conducted on a single NVIDIA A100 GPU. We measure and analyze the execution time of SVD and online PCA on a popular data center GPU (A100) and a consumer GPU (RTX 3090).
Software Dependencies	No	The paper mentions 'Pytorch implementation' but does not provide specific version numbers for PyTorch or any other software libraries or dependencies.
Experiment Setup	Yes	Learning rate: For the small model (60M), learning rate choices are more flexible, producing similar results. However, for larger models (350M, 1B), we recommend using a learning rate that is 10 times smaller, specifically 0.001. Batch size is set to 512 and gradient clipping is set to 1.0. Warmup is set to 10% of the total training steps. We set α = 5 for all experiments and set λ = 0.1 for all subsequent experiments. Table 1: Pretraining LLa MA 1B with a sequence length of 256 and for 10K steps...