Mechanistic Design and Scaling of Hybrid Architectures
Authors: Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Re, Ce Zhang, Stefano Massaroli
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate the resulting architectures via an extensive compute-optimal and a new stateoptimal scaling law analysis, training over 500 language models between 70M to 7B parameters. |
| Researcher Affiliation | Collaboration | 1Together AI 2Stanford University 3Hessian AI 4RIKEN 5The University of Tokyo 6Arc Institute 7CZ Biohub 8Liquid AI. |
| Pseudocode | No | The paper describes methods and processes through textual descriptions and schematics but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing the code for the described methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We test a suite of architectures in the MAD protocol... We connect the performance of architectures on MAD to their performance at scale on The Pile [9]. |
| Dataset Splits | No | The paper states 'Model performances are always evaluated in an independent evaluation dataset, specific to each task setting.' and discusses optimal allocation of compute, but does not explicitly provide training/validation/test splits (e.g., percentages or sample counts) for the main language model training. |
| Hardware Specification | No | The paper mentions a 'supercomputer forty-two' from 'Hessian.AISC Service Center' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and 'RMSNorm' but does not provide specific version numbers for any of its software dependencies. |
| Experiment Setup | Yes | For each MAD task, we train models according to the setting described in Table B.1, using a standard cross-entropy loss objective. Note that we sweep all evaluated architectures over a 3 × 2 grid of learning rate and weight decay values (see Table B.1)...Table C.2: Common settings across all architectures. |