reproducibilityindex.ai

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Authors: Atli Kosson, Bettina Messmer, Martin Jaggi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8].
Researcher Affiliation	Academia	Atli Kosson Bettina Messmer Martin Jaggi EPFL, Switzerland firstname.lastname@epfl.ch
Pseudocode	Yes	Algorithm 1 Adam W (Py Torch variant, differing from the original by Loshchilov and Hutter [24])
Open Source Code	No	The datasets we use are freely available online but we have not released our code.
Open Datasets	Yes	Our main experiments focus on the training of a 124M parameter GPT2 [29] model on the Open Web Text corpus [8]. [8] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.
Dataset Splits	No	The paper mentions 'Validation Loss' in its figures, indicating the use of a validation set, but it does not provide explicit details on the dataset splits (e.g., percentages or sample counts for training, validation, and test sets).
Hardware Specification	Yes	Our experiments are performed on A100 GPUs with either 40GB or 80GB of RAM.
Software Dependencies	No	The paper mentions 'Py Torch variant' for Adam W (Algorithm 1) and that 'Our code is based on Nano GPT', but it does not specify version numbers for any software dependencies.
Experiment Setup	Yes	Our base training is performed at batch size 480 with a sequence length of 1024. We train for 5000 iterations... The baselines use Adam W [24] (see algo. 1) with weight decay λ = 0.1, momentum coefficient β1 = 0.9, smoothing coefficient β2 = 0.95, and ε = 10 8. The learning rate schedule consists of a linear warmup followed by a constant phase and eventually linear cooldown spanning half of training.