LoQT: Low-Rank Adapters for Quantized Pretraining

Authors: Sebastian Loeschcke, Mads Toftrup, Michael Kastoryano, Serge Belongie, Vésteinn Snæbjarnarson

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Lo QT on language model pretraining by training LLa MA-based [15] language models on the C4 dataset [16].
Researcher Affiliation Academia Sebastian Loeschcke University of Copenhagen sbl@di.ku.dk Mads Toftrup Aarhus University toftrup@cs.au.dk Michael J. Kastoryano University of Copenhagen mika@di.ku.dk Serge Belongie University of Copenhagen s.belongie@di.ku.dk Vésteinn Snæbjarnarson University of Copenhagen vesn@di.ku.dk
Pseudocode Yes Figure 3: Pseudo-code for Lo QT. Algorithm 1 Lo QT: Low Rank Adapters for Quantized Training
Open Source Code Yes https://github.com/sebulo/Lo QT
Open Datasets Yes We evaluate Lo QT on language model pretraining by training LLa MA-based [15] language models on the C4 dataset [16], a collection of text in English that was extracted from the Common Crawl web-scrapes [16].
Dataset Splits Yes Table 1: Comparison of low-rank pre-training methods for LLa MA2-style language models on the C4 dataset. The table shows validation perplexity, memory estimates, and quantization states for Lo QT.
Hardware Specification Yes Runs were conducted on up to 4x 40GB NVIDIA A100s 2x 80GB NVIDIA H100s, or a single 24GB NVIDIA RTX 3090.
Software Dependencies No The paper mentions software like 'BF16 format', 'NF4 precision', and 'Adam optimizer', but does not provide specific version numbers for these or other software libraries.
Experiment Setup Yes We keep hyperparameters consistent across model sizes, with experiments conducted in BF16 format for memory efficiency. All models are trained with a maximum sequence length of 256, a total token batch size of 131K tokens, and a learning rate warmup for the first 10% of the training steps, followed by cosine annealing to 10% of the initial learning rate. Full experimental details, including the specific hyperparameters for each task, are provided in Appendix B.