Memory-Efficient LLM Training with Online Subspace Descent
Authors: Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that for the task of pretraining LLa MA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines. |
| Researcher Affiliation | Academia | Kaizhao Liang , Bo Liu , Lizhang Chen , Qiang Liu The University of Texas at Austin {kaizhaol,bliu,lzchen,lqiang}@utexas.edu |
| Pseudocode | Yes | Algorithm 1 Online Subspace Descent |
| Open Source Code | Yes | Code is available at https://github.com/kyleliang919/Online-Subspace-Descent. |
| Open Datasets | Yes | pretraining LLa MA models ranging from 60M to 7B parameters on the C4 dataset [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ar Xiv e-prints, 2019. |
| Dataset Splits | No | The paper mentions using a validation set for perplexity (e.g., Figure 1 caption: "Validation perplexity of LLa MA 1B...") but does not provide specific percentages or sample counts for the training, validation, and test splits of the C4 dataset used for pretraining. |
| Hardware Specification | Yes | All experiments except for large 7B experiments are conducted on a single NVIDIA A100 GPU. We measure and analyze the execution time of SVD and online PCA on a popular data center GPU (A100) and a consumer GPU (RTX 3090). |
| Software Dependencies | No | The paper mentions 'Pytorch implementation' but does not provide specific version numbers for PyTorch or any other software libraries or dependencies. |
| Experiment Setup | Yes | Learning rate: For the small model (60M), learning rate choices are more flexible, producing similar results. However, for larger models (350M, 1B), we recommend using a learning rate that is 10 times smaller, specifically 0.001. Batch size is set to 512 and gradient clipping is set to 1.0. Warmup is set to 10% of the total training steps. We set α = 5 for all experiments and set λ = 0.1 for all subsequent experiments. Table 1: Pretraining LLa MA 1B with a sequence length of 256 and for 10K steps... |