reproducibilityindex.ai

Scaling Exponents Across Parameterizations and Optimizers

Authors: Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters.
Researcher Affiliation	Collaboration	1Google Deep Mind 2MIT 3Work done at Google Deep Mind. Correspondence to: Katie Everett <everettk@google.com>.
Pseudocode	No	The paper provides mathematical derivations and a specific code snippet, but no formally labeled pseudocode or algorithm block.
Open Source Code	No	The paper provides a code snippet for Adam-atan2 and references the Nano DO codebase (http://github.com/google-deepmind/nanodo) which some authors contributed to, but does not explicitly state that the full source code for the methodology described in this paper is being released or available for download.
Open Datasets	Yes	All models are trained on the C4 dataset encoded with the T5 Sentence Piece (Kudo & Richardson, 2018) tokenizer (Raffel et al., 2020)
Dataset Splits	No	The paper mentions 'evaluation inputs' but does not specify the dataset splits (e.g., percentages or counts for training, validation, and test sets) or the methodology for creating these splits.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments, beyond mentioning distributed training techniques.
Software Dependencies	No	All experiments are implemented in Flax (Heek et al., 2023) on top of JAX (Bradbury et al., 2018) and use Optax optimizers (Babuschkin et al., 2020). Specific version numbers for these software components are not provided.
Experiment Setup	Yes	We use a ﬁxed batch size 256, context length 512 and depth L = 8 for all experiments. Unless stated otherwise, we use typical default optimizer hyperparameters (for SGD, momentum m = 0.9, for Adam, ϵ = 10 9, and for Adafactor ϵ = 10 30). We do not use weight decay except for in the weight decay experiments in Figure 9, and we do not use dropout. The learning rate schedule for all experiments uses linear warmup of 1, 000 steps followed by a cosine decay schedule with initial and ﬁnal learning rates of 0.0.