reproducibilityindex.ai

Unified Scaling Laws for Routed Language Models

Authors: Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George Bm Van Den Driessche, Eliza Rutherford, Tom Hennigan, Matthew J Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis derives from an extensive evaluation of Routing Networks across ﬁve orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.
Researcher Affiliation	Industry	1Deep Mind 2Google Research. Correspondence to: Aidan Clark <aidan.b.clark@gmail.com>, Diego de las Cases <diegolascasas@google.com>.
Pseudocode	No	No pseudocode or algorithm blocks found.
Open Source Code	Yes	The data used to derive the scaling laws is available in a Git Hub repository2. 2https://github.com/deepmind/scaling_ laws_for_routing
Open Datasets	No	We train on a multitrillion-token compendium of English language text comprising documents from the internet alongside open-source text datasets, details of which are given in Rae et al. (2021).
Dataset Splits	No	No explicit training/test/validation split percentages or sample counts were found.
Hardware Specification	Yes	All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019).
Software Dependencies	Yes	All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). Models were trained with a sequence length of 2048 and batch size of 256 for 250,000 steps, i.e. 130 billion tokens, regardless of N or E. This is an important detail, and we discuss some of the implications in App. F. All were optimized with Adam W (Loshchilov & Hutter, 2018) and Ze RO Stage 1 was used to shard the optimizer state (Rajbhandari et al., 2020). App. A contains further details.
Experiment Setup	Yes	All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). Models were trained with a sequence length of 2048 and batch size of 256 for 250,000 steps, i.e. 130 billion tokens, regardless of N or E. This is an important detail, and we discuss some of the implications in App. F. All were optimized with Adam W (Loshchilov & Hutter, 2018) and Ze RO Stage 1 was used to shard the optimizer state (Rajbhandari et al., 2020). App. A contains further details.