Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
STAR: Synthesis of Tailored Architectures
Authors: Armin Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, Michael Poli
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling. ... We evaluate STAR on autoregressive language modeling... |
| Researcher Affiliation | Collaboration | 1Liquid AI 2The University of Tokyo 3RIKEN 4Stanford University |
| Pseudocode | No | The paper describes the key steps of STAR evolution (Assessment, Pairing, Recombination, Mutation) in text and Figure 4.1 visually summarizes them, but it does not include a formal, structured pseudocode or algorithm block. |
| Open Source Code | No | The reproducibility statement mentions running optimization and training on open-source datasets and reporting details about the evolutionary algorithms, but it does not explicitly state that the authors' own source code for the methodology will be released or provide a link to a repository. |
| Open Datasets | Yes | Experiments are performed in autoregressive language modeling on 4096 token sequences from the Red Pajama dataset (Weber et al., 2024). |
| Dataset Splits | No | The paper specifies training token counts (e.g., '1.3B tokens', '5B tokens', '40B tokens') and evaluation token counts ('500M-token evaluation set') from the Red Pajama dataset, but it does not provide specific train/validation/test dataset splits (e.g., percentages, exact sample counts for each split) or refer to standard predefined splits for the dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions using 'Adam W (Loshchilov et al., 2017)' as an optimizer, but it does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation. |
| Experiment Setup | Yes | During STAR evolution, models are trained from scratch for 1.3B tokens using Adam W (Loshchilov et al., 2017) with a peak learning rate of 0.0008, a batch size of 0.25M tokens, and a cosine learning rate schedule with a 130M-token linear warmup. The resulting synthesized backbones are evaluated by training them from scratch for 5B tokens under the same setup but with an extended warmup of 400M tokens. Additionally, we train select 1B-parameter models (48 LIVs at a width of 2048) for 40B tokens, increasing the batch size to 0.75M tokens and the warmup to 2.6B tokens. Appendix A.2 further details training settings including optimizer momentum, weight decay, dropout, and gradient clipping in Tables A.1, A.2, and A.3. |