Mechanistic Design and Scaling of Hybrid Architectures

Authors: Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Re, Ce Zhang, Stefano Massaroli

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate the resulting architectures via an extensive compute-optimal and a new stateoptimal scaling law analysis, training over 500 language models between 70M to 7B parameters.
Researcher Affiliation Collaboration 1Together AI 2Stanford University 3Hessian AI 4RIKEN 5The University of Tokyo 6Arc Institute 7CZ Biohub 8Liquid AI.
Pseudocode No The paper describes methods and processes through textual descriptions and schematics but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We test a suite of architectures in the MAD protocol... We connect the performance of architectures on MAD to their performance at scale on The Pile [9].
Dataset Splits No The paper states 'Model performances are always evaluated in an independent evaluation dataset, specific to each task setting.' and discusses optimal allocation of compute, but does not explicitly provide training/validation/test splits (e.g., percentages or sample counts) for the main language model training.
Hardware Specification No The paper mentions a 'supercomputer forty-two' from 'Hessian.AISC Service Center' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'RMSNorm' but does not provide specific version numbers for any of its software dependencies.
Experiment Setup Yes For each MAD task, we train models according to the setting described in Table B.1, using a standard cross-entropy loss objective. Note that we sweep all evaluated architectures over a 3 × 2 grid of learning rate and weight decay values (see Table B.1)...Table C.2: Common settings across all architectures.