Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Authors: Alex Hägele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, Martin Jaggi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a multitude of training runs, we demonstrate how a simple alternative of performing a cooldown after a constant learning rate which was already suggested in the literature (Zhai et al., 2022) and recently used by released models (Hu et al., 2024; Shen et al., 2024) matches the performance of cosine. We expand on this and provide and analyze different recipes for the decay form and length, which scale as reliable as cosine, outperforming it for sufficiently long cooldowns. |
| Researcher Affiliation | Collaboration | Alexander Hägele1 Elie Bakouch2 Atli Kosson1 Loubna Ben Allal2 Leandro Von Werra2 Martin Jaggi1 1EPFL 2Hugging Face alexander.hagele@epfl.ch |
| Pseudocode | Yes | We provide our Python code in Figure 14. |
| Open Source Code | Yes | Our code is available at https://github.com/epfml/schedules-and-scaling/. |
| Open Datasets | Yes | We train on a subset of Slim Pajama (Soboleva et al., 2023) with 6B tokens1, a cleaned and deduplicated corpus for LLM pretraining, which we split into train and validation sequences and report validation loss (perplexity). We provide all the details in Appendix A.1. 1https://huggingface.co/datasets/DKYoon/Slim Pajama-6B |
| Dataset Splits | No | The paper states: 'We use a subset of the full corpus that comprises roughly 6B tokens and randomly sample a validation set of roughly 3M tokens.' For the 1B and 8B models, it describes evaluation on common LLM benchmarks. While training and validation splits are mentioned for Slim Pajama, a separate 'test' split derived from the Slim Pajama dataset itself is not explicitly defined in terms of size or percentage, with testing done on external benchmarks. |
| Hardware Specification | Yes | All experiments (aside from 1B and 8B, see A.2) were performed using a cluster of A100 GPUs (both 40GB/80GB RAM) with 2 data-parallel (i.e. 2 GPUs per run). Some selected runs used a single node of 8 H100s. ... Each run for the 1B model was performed on 4x H100s GPUs. For the 8B model, ... here, we use 12 nodes, each composed of 4x GH200 GPUs. |
| Software Dependencies | No | The paper mentions key software components like 'Py Torch', 'Flash Attention', and the 'nanotron library'. However, it does not provide specific version numbers for these software dependencies, which is required for reproducibility (e.g., 'PyTorch 1.9' or 'Flash Attention v2.1'). |
| Experiment Setup | Yes | Throughout this paper, we use the Adam W optimizer with weight decay (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) with common LLM training parameters. ... we follow standard practices in LLM training and use the Adam W optimizer with beta parameters (β1, β2) = (0.9, 0.95), decoupled weight decay of 0.1 ... and gradient clipping with 1.0. For warmup steps, we use a short warmup of 300 steps for the majority of runs and 1000 3000 for longer runs (above 100k total steps). The cosine schedule decays the learning rate to 10% of the maximum learning rate. For most of our experiments, we use a batch size of 200, i.e., roughly 0.1M tokens for a sequence length of 512. ... We provide an overview of the model sizes and configurations in Table 1 and the parameters for training and length in Table 2. All models for the scaling law experiments are trained with 300 warmup steps and a sequence length of 512. |