Scaling Exponents Across Parameterizations and Optimizers

Authors: Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters.
Researcher Affiliation Collaboration 1Google Deep Mind 2MIT 3Work done at Google Deep Mind. Correspondence to: Katie Everett <everettk@google.com>.
Pseudocode No The paper provides mathematical derivations and a specific code snippet, but no formally labeled pseudocode or algorithm block.
Open Source Code No The paper provides a code snippet for Adam-atan2 and references the Nano DO codebase (http://github.com/google-deepmind/nanodo) which some authors contributed to, but does not explicitly state that the full source code for the methodology described in *this* paper is being released or available for download.
Open Datasets Yes All models are trained on the C4 dataset encoded with the T5 Sentence Piece (Kudo & Richardson, 2018) tokenizer (Raffel et al., 2020)
Dataset Splits No The paper mentions 'evaluation inputs' but does not specify the dataset splits (e.g., percentages or counts for training, validation, and test sets) or the methodology for creating these splits.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments, beyond mentioning distributed training techniques.
Software Dependencies No All experiments are implemented in Flax (Heek et al., 2023) on top of JAX (Bradbury et al., 2018) and use Optax optimizers (Babuschkin et al., 2020). Specific version numbers for these software components are not provided.
Experiment Setup Yes We use a fixed batch size 256, context length 512 and depth L = 8 for all experiments. Unless stated otherwise, we use typical default optimizer hyperparameters (for SGD, momentum m = 0.9, for Adam, ϵ = 10 9, and for Adafactor ϵ = 10 30). We do not use weight decay except for in the weight decay experiments in Figure 9, and we do not use dropout. The learning rate schedule for all experiments uses linear warmup of 1, 000 steps followed by a cosine decay schedule with initial and final learning rates of 0.0.