Memory-Efficient LLM Training with Online Subspace Descent

Authors: Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that for the task of pretraining LLa MA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.
Researcher Affiliation Academia Kaizhao Liang , Bo Liu , Lizhang Chen , Qiang Liu The University of Texas at Austin {kaizhaol,bliu,lzchen,lqiang}@utexas.edu
Pseudocode Yes Algorithm 1 Online Subspace Descent
Open Source Code Yes Code is available at https://github.com/kyleliang919/Online-Subspace-Descent.
Open Datasets Yes pretraining LLa MA models ranging from 60M to 7B parameters on the C4 dataset [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ar Xiv e-prints, 2019.
Dataset Splits No The paper mentions using a validation set for perplexity (e.g., Figure 1 caption: "Validation perplexity of LLa MA 1B...") but does not provide specific percentages or sample counts for the training, validation, and test splits of the C4 dataset used for pretraining.
Hardware Specification Yes All experiments except for large 7B experiments are conducted on a single NVIDIA A100 GPU. We measure and analyze the execution time of SVD and online PCA on a popular data center GPU (A100) and a consumer GPU (RTX 3090).
Software Dependencies No The paper mentions 'Pytorch implementation' but does not provide specific version numbers for PyTorch or any other software libraries or dependencies.
Experiment Setup Yes Learning rate: For the small model (60M), learning rate choices are more flexible, producing similar results. However, for larger models (350M, 1B), we recommend using a learning rate that is 10 times smaller, specifically 0.001. Batch size is set to 512 and gradient clipping is set to 1.0. Warmup is set to 10% of the total training steps. We set α = 5 for all experiments and set λ = 0.1 for all subsequent experiments. Table 1: Pretraining LLa MA 1B with a sequence length of 256 and for 10K steps...