Cramming: Training a Language Model on a single GPU in one day.

Authors: Jonas Geiping, Tom Goldstein

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
Researcher Affiliation Academia 1Dep. of Computer Science, University of Maryland, College Park. Correspondence to: Jonas Geiping <jgeiping@umd.edu>.
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes We provide code to reproduce all experiments at github.com/Jonas Geiping/cramming.
Open Datasets Yes We start our investigation with a close analogue to the original raw text sources of Devlin et al. (2019), using a recent dump of the English Wikipedia (20220301.en) and English bookcorpus, noting the commentary of Tan (2019); Bandy & Vincent (2021). [...] We test several subsets of The Pile (Gao et al., 2020). [...] Another popular source of data is C4, the colossal, cleaned version of Common Crawl (Raffel et al., 2020), from which we stream the first 20 106 entries. Finally, we also include the 2019 release of the OSCAR dataset (Su arez et al., 2019), denoted by oscar.
Dataset Splits Yes Downstream performance is evaluated on GLUE (Wang et al., 2018). Downstream finetuning on GLUE is limited to brief training with only the training data of the downstream task (we consider 5 epochs or less) and needs to work with hyperparameters set globally for all GLUE tasks. [...] Finally, we systematically evaluate performance on the GLUE benchmark of Wang et al. (2018), minus WNLI as in Devlin et al. (2019). We note that we only use MNLI (m) during the previous sections and do not tune hyperparameters based on the full GLUE scores. [...] Table 3 and Table 4 describe the performance of this setup on the GLUE downstream tasks (as median over 5 downstream trials).
Hardware Specification Yes In our implementation, we analyze a setup with a classical rtx2080ti GPU (released September 2018) and separate setups with a more modern rtxa4000 or a rtxa6000 GPU, 48GB version (released October 2020). We pair each unit with 4 CPU cores and 32GB of RAM.
Software Dependencies No The paper states, "We implement everything in Py Torch (Paszke et al., 2017)" but does not provide specific version numbers for PyTorch or any other software libraries or dependencies used.
Experiment Setup Yes We implement everything in Py Torch (Paszke et al., 2017) and to limit our gains from the software lottery (Hooker, 2021) we do not use specialized implementations, which would further bias results towards well-established components. [...] We run all experiments and ablation studies with automated mixed precision (Micikevicius et al., 2018). [...] We train with only masked language modeling on fully packed blocks of tokens with a masking rate of 25% and the original setup of Devlin et al. (2019) where 10% of all masks are filled with random words and 10% unchanged. [...] We keep Adam (Kingma & Ba, 2015) as the optimizer of choice, with weight decay of 0.01 as described in (Loshchilov & Hutter, 2017) (i.e. Adam W ), β1 = 0.9, β2 = 0.98 and ε = 10 12. To stabilize training at no extra cost, we include gradient clipping at 0.5. [...] We find that a simple one-cycle learning schedules (Smith & Topin, 2018), with a peak learning rate of 10 3 lead to minimal pretraining loss within our budget, with the optimum being a triangular shape (denoted triangular in Figure 2) that mimics a long warmup period with a quick decay. [...] We find that the optimal batch size in this setting is around 2048 for minimal pretraining loss, but around 8192 for maximal downstream performance, see Figure 3. We accumulate gradients and only perform an update every 85 forward/backward passes.