Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Authors: Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train Vision Transformers (Dosovitskiy et al., 2020) and Language Models (Radford et al., 2019) and showcase how an adaptive training scheme can lead to substantial training FLOPs reduction, in some cases more than 50%.
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich 2ETH AI Center.
Pseudocode No The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing open-source code or a link to a code repository for the described methodology.
Open Datasets Yes To explore this trade-off, we pre-train Vision Transformers of different sizes (see Fig. 2(b) for a summary) on the public Image Net-21k dataset (Ridnik et al., 2021), employing various patch sizes that remain fixed throughout training.
Dataset Splits No For ViT experiments, the paper mentions measuring '10-shot error on the Image Net-1k dataset' but does not specify the training, validation, or test splits. For LLMs, it mentions 'We evaluate on a held-out validation set' without providing specific split percentages or sample counts for reproducibility.
Hardware Specification No The paper mentions 'GPU-hours' and '280W' for power consumption, but it does not specify exact GPU models (e.g., NVIDIA A100, Tesla V100) or other detailed hardware components like CPU models or memory specifications.
Software Dependencies No The paper mentions the use of 'scipy' for function minimization and cites it, but it does not provide a specific version number for scipy or any other software dependencies.
Experiment Setup Yes In Tab. 1 we showcase hyper-parameters used when training on Image Net-21k. We optimized each of the parameters for the different model classes by training for different configurations for a fixed, small amount of compute, namely 4 x 10^17 FLOPs. Some examples of such hyper-parameter search are illustrated in Fig 13. All experiments were conducted using bfloat16.