Unified Scaling Laws for Routed Language Models
Authors: Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George Bm Van Den Driessche, Eliza Rutherford, Tom Hennigan, Matthew J Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters. |
| Researcher Affiliation | Industry | 1Deep Mind 2Google Research. Correspondence to: Aidan Clark <aidan.b.clark@gmail.com>, Diego de las Cases <diegolascasas@google.com>. |
| Pseudocode | No | No pseudocode or algorithm blocks found. |
| Open Source Code | Yes | The data used to derive the scaling laws is available in a Git Hub repository2. 2https://github.com/deepmind/scaling_ laws_for_routing |
| Open Datasets | No | We train on a multitrillion-token compendium of English language text comprising documents from the internet alongside open-source text datasets, details of which are given in Rae et al. (2021). |
| Dataset Splits | No | No explicit training/test/validation split percentages or sample counts were found. |
| Hardware Specification | Yes | All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). |
| Software Dependencies | Yes | All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). Models were trained with a sequence length of 2048 and batch size of 256 for 250,000 steps, i.e. 130 billion tokens, regardless of N or E. This is an important detail, and we discuss some of the implications in App. F. All were optimized with Adam W (Loshchilov & Hutter, 2018) and Ze RO Stage 1 was used to shard the optimizer state (Rajbhandari et al., 2020). App. A contains further details. |
| Experiment Setup | Yes | All models are trained on TPUs with JAX (Bradbury et al., 2018) using a combination of data, expert (see App. C) and sharding parallelism (Shoeybi et al., 2019). Models were trained with a sequence length of 2048 and batch size of 256 for 250,000 steps, i.e. 130 billion tokens, regardless of N or E. This is an important detail, and we discuss some of the implications in App. F. All were optimized with Adam W (Loshchilov & Hutter, 2018) and Ze RO Stage 1 was used to shard the optimizer state (Rajbhandari et al., 2020). App. A contains further details. |