Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Authors: Atli Kosson, Bettina Messmer, Martin Jaggi
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8]. |
| Researcher Affiliation | Academia | Atli Kosson Bettina Messmer Martin Jaggi EPFL, Switzerland EMAIL |
| Pseudocode | Yes | Algorithm 1 Adam W (Py Torch variant, differing from the original by Loshchilov and Hutter [24]) |
| Open Source Code | No | The datasets we use are freely available online but we have not released our code. |
| Open Datasets | Yes | Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8]. [8] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019. |
| Dataset Splits | No | The paper mentions 'Validation Loss' in its figures, indicating the use of a validation set, but it does not provide explicit details on the dataset splits (e.g., percentages or sample counts for training, validation, and test sets). |
| Hardware Specification | Yes | Our experiments are performed on A100 GPUs with either 40GB or 80GB of RAM. |
| Software Dependencies | No | The paper mentions 'Py Torch variant' for Adam W (Algorithm 1) and that 'Our code is based on Nano GPT', but it does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | Our base training is performed at batch size 480 with a sequence length of 1024. We train for 5000 iterations... The baselines use Adam W [24] (see algo. 1) with weight decay λ = 0.1, momentum coefficient β1 = 0.9, smoothing coefficient β2 = 0.95, and ε = 10 8. The learning rate schedule consists of a linear warmup followed by a constant phase and eventually linear cooldown spanning half of training. |