The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
Authors: Conglong Li, Minjia Zhang, Yuxiong He
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an in-depth analysis on large-scale pre-training experiments replicating the GPT-2 model with public dataset. We find that there is a strong correlation between training instability and extreme values of gradient variance. We further identify that samples with long sequence lengths contribute to these extreme gradient variance values, especially at the beginning of the training, indicating that long sequence length can be a main source of training instability. Based on the analysis, we present a simple yet effective Sequence Length Warmup method that aims to solve the training stability-efficiency dilemma by avoiding extreme gradient variance values. Moreover, we present a lightweight tuning strategy that allows us to tune our method with just a small portion of the expensive full training. Experiments replicating GPT-2 models (117M and 1.5B) show that our approach enables stable training with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training instability. To achieve the same or better zero-shot evaluation results, our method reduces the required number of training tokens and wall clock time by up to 2.2x and 3.7x, respectively. Experiments replicating GPT-3 model (125M) show that our approach enables stable training with 8x larger batch size and 40x larger learning rate, and retains 99% of the zero-shot accuracy on 11 tasks using 10x less data and 17x less time compared to the original GPT-3 training recipe, while the baseline diverges under the same settings and only retain 95% of accuracy under lower learning rate. |
| Researcher Affiliation | Industry | Conglong Li Microsoft conglong.li@microsoft.com Minjia Zhang Microsoft minjiaz@microsoft.com Yuxiong He Microsoft yuxhe@microsoft.com |
| Pseudocode | No | The paper describes the Sequence Length Warmup method verbally but does not provide pseudocode or a formal algorithm block. |
| Open Source Code | Yes | The implementation of our approach as well as the necessary changes to the GPT-2/3 pre-training framework has been open sourced in a deep learning optimization library called Deep Speed1. (1https://github.com/microsoft/Deep Speed, https://www.deepspeed.ai/) |
| Open Datasets | Yes | For training data, we collect and use the same dataset blend as the Megatron-LM work: Wikipedia [11], CC-Stories [45], Real News [54], and Open Webtext [32]. |
| Dataset Splits | No | The paper mentions using a "validation set" for perplexity analysis, but does not specify the explicit split percentages or counts for how the overall training data was partitioned into training and validation sets for reproducibility. |
| Hardware Specification | Yes | All of the experiments are performed on 128 NVIDIA V100 GPUs (32GB memory). There are 16 nodes and 8 GPUs per node. GPUs inside the same node are connected by NVLink 2.0, and nodes are connected by a 100 Gigabit Infini Band EDR inter-node network. |
| Software Dependencies | No | The paper mentions the use of 'Adam optimizer' and 'mixed precision/FP16 training' but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions) needed for reproduction. |
| Experiment Setup | Yes | The first set follows the Megatron-LM work: batch size 512, 300K total training steps (157B tokens), and learning rate 1.5 10 4 with a linear warmup of 3K steps and a single cycle cosine decay over the remaining 297K steps (1 10 5 min. learning rate). The second parameter set tests a more aggressive training strategy: batch size 4K (8 larger), 37.5K total training steps (157B tokens4), and learning rate 6 10 4 (4 larger) with a linear warmup of 3K steps and a single cycle cosine decay over the remaining 34.5K steps (same min. learning rate). For sequence length/context size, we mainly use 1K which is the default for GPT-2. But we also test 2K (on the 117M model with batch size 512 and 157B tokens) which is the default for GPT-3. All experiments are performed with mixed precision/FP16 training, Adam optimizer (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8) [17], 0.01 weight decay, same random seed, and gradient clipping at 1.0. |