Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STAR: Synthesis of Tailored Architectures

Authors: Armin Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, Michael Poli

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling. ... We evaluate STAR on autoregressive language modeling...
Researcher Affiliation	Collaboration	1Liquid AI 2The University of Tokyo 3RIKEN 4Stanford University
Pseudocode	No	The paper describes the key steps of STAR evolution (Assessment, Pairing, Recombination, Mutation) in text and Figure 4.1 visually summarizes them, but it does not include a formal, structured pseudocode or algorithm block.
Open Source Code	No	The reproducibility statement mentions running optimization and training on open-source datasets and reporting details about the evolutionary algorithms, but it does not explicitly state that the authors' own source code for the methodology will be released or provide a link to a repository.
Open Datasets	Yes	Experiments are performed in autoregressive language modeling on 4096 token sequences from the Red Pajama dataset (Weber et al., 2024).
Dataset Splits	No	The paper specifies training token counts (e.g., '1.3B tokens', '5B tokens', '40B tokens') and evaluation token counts ('500M-token evaluation set') from the Red Pajama dataset, but it does not provide specific train/validation/test dataset splits (e.g., percentages, exact sample counts for each split) or refer to standard predefined splits for the dataset.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU models, or cloud computing instance types.
Software Dependencies	No	The paper mentions using 'Adam W (Loshchilov et al., 2017)' as an optimizer, but it does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation.
Experiment Setup	Yes	During STAR evolution, models are trained from scratch for 1.3B tokens using Adam W (Loshchilov et al., 2017) with a peak learning rate of 0.0008, a batch size of 0.25M tokens, and a cosine learning rate schedule with a 130M-token linear warmup. The resulting synthesized backbones are evaluated by training them from scratch for 5B tokens under the same setup but with an extended warmup of 400M tokens. Additionally, we train select 1B-parameter models (48 LIVs at a width of 2048) for 40B tokens, increasing the batch size to 0.75M tokens and the warmup to 2.6B tokens. Appendix A.2 further details training settings including optimizer momentum, weight decay, dropout, and gradient clipping in Tables A.1, A.2, and A.3.