Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Authors: Atli Kosson, Bettina Messmer, Martin Jaggi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8]. |
| Researcher Affiliation | Academia | Atli Kosson Bettina Messmer Martin Jaggi EPFL, Switzerland firstname.lastname@epfl.ch |
| Pseudocode | Yes | Algorithm 1 Adam W (Py Torch variant, differing from the original by Loshchilov and Hutter [24]) |
| Open Source Code | No | The datasets we use are freely available online but we have not released our code. |
| Open Datasets | Yes | Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8]. [8] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019. |
| Dataset Splits | No | The paper mentions 'Validation Loss' in its figures, indicating the use of a validation set, but it does not provide explicit details on the dataset splits (e.g., percentages or sample counts for training, validation, and test sets). |
| Hardware Specification | Yes | Our experiments are performed on A100 GPUs with either 40GB or 80GB of RAM. |
| Software Dependencies | No | The paper mentions 'Py Torch variant' for Adam W (Algorithm 1) and that 'Our code is based on Nano GPT', but it does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | Our base training is performed at batch size 480 with a sequence length of 1024. We train for 5000 iterations... The baselines use Adam W [24] (see algo. 1) with weight decay λ = 0.1, momentum coefficient β1 = 0.9, smoothing coefficient β2 = 0.95, and ε = 10 8. The learning rate schedule consists of a linear warmup followed by a constant phase and eventually linear cooldown spanning half of training. |