Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Best Practices for Scientific Research on Neural Architecture Search

Authors: Marius Lindauer, Frank Hutter

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Although the community has made major strides in developing better NAS methods, the quality of scientific empirical evaluations in the young field of NAS is still lacking behind that of other areas of machine learning. To address this issue, we describe a set of possible issues and ways to avoid them, leading to the NAS best practices checklist available at http://automl.org/nas_checklist.pdf. ... We therefore propose best practices for empirical evaluations of NAS methods, which we believe will facilitate sustained and measurable progress in the field.
Researcher Affiliation	Collaboration	Marius Lindauer EMAIL Leibniz University of Hannover Hannover, 30167, Germany. Frank Hutter EMAIL University of Freiburg & Bosch Center for Artificial Intelligence Freiburg im Breisgau, 79110, Germany.
Pseudocode	No	The paper describes best practices and concepts in text and definitions but does not present any pseudocode or algorithm blocks for its own methodology.
Open Source Code	No	The paper strongly advocates for releasing code in Section 2, but it does not state that the code for the best practices themselves or any new methodology introduced by the authors of this paper is made available. It refers to a checklist and other third-party resources. For example: "We encourage anyone who can do so to simply put a copy of the code online as it was used, appropriately labelled as prototype research code, without using extra time to clean it up." and "We note that on standard NAS benchmarks, for most researchers, due to limited computational resources it will be impossible to satisfy the best practices in this section (especially ablation studies and performing several repeated runs). Especially in such cases, we advocate running extensive evaluations on tabular NAS benchmarks, or on surrogate benchmarks as proposed by Siems et al. (2020) for NAS following the work of Eggensperger et al. (2015; 2018). We list available tabular and surrogate NAS benchmarks in Table 1."
Open Datasets	Yes	The seminal paper by Zoph and Le (2017) used the CIFAR-10 and PTB datasets for its empirical evaluation, and more than 300 NAS papers later, these datasets still dominate in empirical evaluations.
Dataset Splits	Yes	Definition 3 (NAS Benchmark) A NAS benchmark consists of a dataset (with a predefined training-test split4), a search space5, and available runnable code with pre-defined hyperparameters for training the architectures. ... Example 4 A prominent NAS benchmark is the publicly available search space and training pipeline of DARTS (Liu et al., 2019b), evaluated on CIFAR-10 (with standard training/test split).
Hardware Specification	No	Best Practice 14: Report All the Details of Your Experimental Setup ... Overall, we recommend to report all the details required to reproduce results all top machine learning conferences allow for a long appendix, such that space is never a reason to omit these details. ... If method A needed twice as much time as method B, but method A was evaluated on an old GPU and method B on a recent one, the difference in GPU may explain the entire difference in speed. The paper emphasizes the importance of reporting hardware, but does not provide details of hardware used for its own work, as it is a best practices paper rather than an experimental one.
Software Dependencies	No	Best Practice 14: Report All the Details of Your Experimental Setup ... Deep Learning libraries, such as tensorflow, pytorch and co are getting more efficient over time, but which version was actually used is unfortunately only reported rarely. The paper mentions various deep learning libraries in the context of best practices, but does not specify particular versions of any software used for its own work.
Experiment Setup	No	Best Practice 12: Report the Use of Hyperparameter Optimization ... It is well known that these hyperparameters can influence results substantially e.g., for DARTS (Liu et al., 2019b), they can make the difference between state-of-the-art performance and converging to degenerate architectures with very poor performance (Zela et al., 2020a). The paper discusses the importance of reporting experimental setup details like hyperparameters but does not provide any specific values or configurations for its own work, as it is a best practices paper rather than an experimental one.