LoQT: Low-Rank Adapters for Quantized Pretraining
Authors: Sebastian Loeschcke, Mads Toftrup, Michael Kastoryano, Serge Belongie, Vésteinn Snæbjarnarson
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Lo QT on language model pretraining by training LLa MA-based [15] language models on the C4 dataset [16]. |
| Researcher Affiliation | Academia | Sebastian Loeschcke University of Copenhagen sbl@di.ku.dk Mads Toftrup Aarhus University toftrup@cs.au.dk Michael J. Kastoryano University of Copenhagen mika@di.ku.dk Serge Belongie University of Copenhagen s.belongie@di.ku.dk Vésteinn Snæbjarnarson University of Copenhagen vesn@di.ku.dk |
| Pseudocode | Yes | Figure 3: Pseudo-code for Lo QT. Algorithm 1 Lo QT: Low Rank Adapters for Quantized Training |
| Open Source Code | Yes | https://github.com/sebulo/Lo QT |
| Open Datasets | Yes | We evaluate Lo QT on language model pretraining by training LLa MA-based [15] language models on the C4 dataset [16], a collection of text in English that was extracted from the Common Crawl web-scrapes [16]. |
| Dataset Splits | Yes | Table 1: Comparison of low-rank pre-training methods for LLa MA2-style language models on the C4 dataset. The table shows validation perplexity, memory estimates, and quantization states for Lo QT. |
| Hardware Specification | Yes | Runs were conducted on up to 4x 40GB NVIDIA A100s 2x 80GB NVIDIA H100s, or a single 24GB NVIDIA RTX 3090. |
| Software Dependencies | No | The paper mentions software like 'BF16 format', 'NF4 precision', and 'Adam optimizer', but does not provide specific version numbers for these or other software libraries. |
| Experiment Setup | Yes | We keep hyperparameters consistent across model sizes, with experiments conducted in BF16 format for memory efficiency. All models are trained with a maximum sequence length of 256, a total token batch size of 131K tokens, and a learning rate warmup for the first 10% of the training steps, followed by cosine annealing to 10% of the initial learning rate. Full experimental details, including the specific hyperparameters for each task, are provided in Appendix B. |