Unified Scaling Laws for Routed Language Models

Authors: Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George Bm Van Den Driessche, Eliza Rutherford, Tom Hennigan, Matthew J Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.
Researcher Affiliation Industry 1Deep Mind 2Google Research. Correspondence to: Aidan Clark <aidan.b.clark@gmail.com>, Diego de las Cases <diegolascasas@google.com>.
Pseudocode No No pseudocode or algorithm blocks found.
Open Source Code Yes The data used to derive the scaling laws is available in a Git Hub repository2. 2https://github.com/deepmind/scaling_ laws_for_routing
Open Datasets No We train on a multitrillion-token compendium of English language text comprising documents from the internet alongside open-source text datasets, details of which are given in Rae et al. (2021).
Dataset Splits No No explicit training/test/validation split percentages or sample counts were found.
Hardware Specification Yes All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019).
Software Dependencies Yes All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). Models were trained with a sequence length of 2048 and batch size of 256 for 250,000 steps, i.e. 130 billion tokens, regardless of N or E. This is an important detail, and we discuss some of the implications in App. F. All were optimized with Adam W (Loshchilov & Hutter, 2018) and Ze RO Stage 1 was used to shard the optimizer state (Rajbhandari et al., 2020). App. A contains further details.
Experiment Setup Yes All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). Models were trained with a sequence length of 2048 and batch size of 256 for 250,000 steps, i.e. 130 billion tokens, regardless of N or E. This is an important detail, and we discuss some of the implications in App. F. All were optimized with Adam W (Loshchilov & Hutter, 2018) and Ze RO Stage 1 was used to shard the optimizer state (Rajbhandari et al., 2020). App. A contains further details.