Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Neural Architecture Search: A Survey

Authors: Thomas Elsken, Jan Hendrik Metzen, Frank Hutter

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide an overview of existing work in this ﬁeld of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy. ... Already by now, NAS methods have outperformed manually designed architectures on some tasks such as image classiﬁcation (Zoph et al., 2018; Real et al., 2019), object detection (Zoph et al., 2018) or semantic segmentation (Chen et al., 2018). ... Real et al. (2019) conduct a case study comparing RL, evolution, and random search (RS), concluding that RL and evolution perform equally well in terms of ﬁnal test accuracy, with evolution having better anytime performance and ﬁnding smaller models. Both approaches consistently perform better than RS in their experiments, but with a rather small margin: RS achieved test errors of approximately 4% on CIFAR-10, while RL and evolution reached approximately 3.5% (after model augmentation where depth and number of ﬁlters was increased; the diﬀerence on the non-augmented space actually used for the search was approx. 2%).
Researcher Affiliation	Collaboration	Thomas Elsken EMAIL Bosch Center for Artiﬁcial Intelligence 71272 Renningen, Germany and University of Freiburg. Jan Hendrik Metzen Jan EMAIL Bosch Center for Artiﬁcial Intelligence 71272 Renningen, Germany. Frank Hutter EMAIL University of Freiburg 79110 Freiburg, Germany.
Pseudocode	No	The paper provides descriptive text and figures to illustrate concepts, but does not contain any clearly labeled pseudocode or algorithm blocks for its own methodology or for the methods it surveys.
Open Source Code	No	The paper is a survey and describes various methodologies, but does not provide any concrete access to source code for the methodology described in this paper by its authors. It mentions code in the context of other works (e.g., 'open-source Auto ML system'), but not for its own contribution.
Open Datasets	Yes	Already by now, NAS methods have outperformed manually designed architectures on some tasks such as image classiﬁcation (Zoph et al., 2018; Real et al., 2019), object detection (Zoph et al., 2018) or semantic segmentation (Chen et al., 2018). ... Real et al. (2019) conduct a case study comparing RL, evolution, and random search (RS), concluding that RL and evolution perform equally well in terms of ﬁnal test accuracy, with evolution having better anytime performance and ﬁnding smaller models. Both approaches consistently perform better than RS in their experiments, but with a rather small margin: RS achieved test errors of approximately 4% on CIFAR-10, while RL and evolution reached approximately 3.5%... The diﬀerence was even smaller for Liu et al. (2018b), who reported a test error of 3.9% on CIFAR-10 and a top-1 validation error of 21.0% on Image Net for RS, compared to 3.75% and 20.3% for their evolution-based method, respectively. ... While most authors report results on the CIFAR-10 data set... ... for optimizing recurrent neural networks (Greﬀet al., 2015; Jozefowicz et al., 2015; Zoph and Le, 2017; Rawal and Miikkulainen, 2018), e.g., for language or music modeling.
Dataset Splits	No	The paper discusses various datasets used in the reviewed literature (e.g., CIFAR-10, ImageNet, Penn Treebank) and mentions that 'measurements of an architecture’s performance depend on many factors other than the architecture itself. While most authors report results on the CIFAR-10 data set, experiments often diﬀer with regard to search space, computational budget, data augmentation, training procedures, regularization, and other factors.' However, it does not explicitly provide specific dataset split information for any experiments conducted by the authors of this survey paper, nor does it detail standard splits for the mentioned public datasets.
Hardware Specification	No	The paper is a survey and describes the hardware usage of other works (e.g., '800 GPUs for three to four weeks', 'computational demands in the order of thousands of GPU days for NAS'), but it does not specify any hardware details for running its own experiments.
Software Dependencies	No	The paper discusses various algorithms and approaches (e.g., 'REINFORCE policy gradient algorithm', 'Proximal Policy Optimization', 'Q-learning', 'Gaussian processes'), but it does not provide specific software names with version numbers for any ancillary software dependencies.
Experiment Setup	No	The paper is a survey and discusses experimental setups of other research (e.g., 'a cosine annealing learning rate schedule', 'data augmentation by Cut Out', 'training for fewer epochs'), but it does not provide specific experimental setup details, hyperparameters, or training configurations for its own work.