Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Authors: Atli Kosson, Bettina Messmer, Martin Jaggi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8].
Researcher Affiliation Academia Atli Kosson Bettina Messmer Martin Jaggi EPFL, Switzerland firstname.lastname@epfl.ch
Pseudocode Yes Algorithm 1 Adam W (Py Torch variant, differing from the original by Loshchilov and Hutter [24])
Open Source Code No The datasets we use are freely available online but we have not released our code.
Open Datasets Yes Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8]. [8] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.
Dataset Splits No The paper mentions 'Validation Loss' in its figures, indicating the use of a validation set, but it does not provide explicit details on the dataset splits (e.g., percentages or sample counts for training, validation, and test sets).
Hardware Specification Yes Our experiments are performed on A100 GPUs with either 40GB or 80GB of RAM.
Software Dependencies No The paper mentions 'Py Torch variant' for Adam W (Algorithm 1) and that 'Our code is based on Nano GPT', but it does not specify version numbers for any software dependencies.
Experiment Setup Yes Our base training is performed at batch size 480 with a sequence length of 1024. We train for 5000 iterations... The baselines use Adam W [24] (see algo. 1) with weight decay λ = 0.1, momentum coefficient β1 = 0.9, smoothing coefficient β2 = 0.95, and ε = 10 8. The learning rate schedule consists of a linear warmup followed by a constant phase and eventually linear cooldown spanning half of training.