reproducibilityindex.ai

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Authors: Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train Vision Transformers (Dosovitskiy et al., 2020) and Language Models (Radford et al., 2019) and showcase how an adaptive training scheme can lead to substantial training FLOPs reduction, in some cases more than 50%.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich 2ETH AI Center.
Pseudocode	No	The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing open-source code or a link to a code repository for the described methodology.
Open Datasets	Yes	To explore this trade-off, we pre-train Vision Transformers of different sizes (see Fig. 2(b) for a summary) on the public Image Net-21k dataset (Ridnik et al., 2021), employing various patch sizes that remain fixed throughout training.
Dataset Splits	No	For ViT experiments, the paper mentions measuring '10-shot error on the Image Net-1k dataset' but does not specify the training, validation, or test splits. For LLMs, it mentions 'We evaluate on a held-out validation set' without providing specific split percentages or sample counts for reproducibility.
Hardware Specification	No	The paper mentions 'GPU-hours' and '280W' for power consumption, but it does not specify exact GPU models (e.g., NVIDIA A100, Tesla V100) or other detailed hardware components like CPU models or memory specifications.
Software Dependencies	No	The paper mentions the use of 'scipy' for function minimization and cites it, but it does not provide a specific version number for scipy or any other software dependencies.
Experiment Setup	Yes	In Tab. 1 we showcase hyper-parameters used when training on Image Net-21k. We optimized each of the parameters for the different model classes by training for different configurations for a fixed, small amount of compute, namely 4 x 10^17 FLOPs. Some examples of such hyper-parameter search are illustrated in Fig 13. All experiments were conducted using bfloat16.