Scaling Exponents Across Parameterizations and Optimizers
Authors: Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2MIT 3Work done at Google Deep Mind. Correspondence to: Katie Everett <everettk@google.com>. |
| Pseudocode | No | The paper provides mathematical derivations and a specific code snippet, but no formally labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper provides a code snippet for Adam-atan2 and references the Nano DO codebase (http://github.com/google-deepmind/nanodo) which some authors contributed to, but does not explicitly state that the full source code for the methodology described in *this* paper is being released or available for download. |
| Open Datasets | Yes | All models are trained on the C4 dataset encoded with the T5 Sentence Piece (Kudo & Richardson, 2018) tokenizer (Raffel et al., 2020) |
| Dataset Splits | No | The paper mentions 'evaluation inputs' but does not specify the dataset splits (e.g., percentages or counts for training, validation, and test sets) or the methodology for creating these splits. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments, beyond mentioning distributed training techniques. |
| Software Dependencies | No | All experiments are implemented in Flax (Heek et al., 2023) on top of JAX (Bradbury et al., 2018) and use Optax optimizers (Babuschkin et al., 2020). Specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | We use a fixed batch size 256, context length 512 and depth L = 8 for all experiments. Unless stated otherwise, we use typical default optimizer hyperparameters (for SGD, momentum m = 0.9, for Adam, ϵ = 10 9, and for Adafactor ϵ = 10 30). We do not use weight decay except for in the weight decay experiments in Figure 9, and we do not use dropout. The learning rate schedule for all experiments uses linear warmup of 1, 000 steps followed by a cosine decay schedule with initial and final learning rates of 0.0. |