Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BOSS: Bayesian Optimization over String Spaces

Authors: Henry Moss, David Leslie, Daniel Beck, Javier González, Paul Rayson

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now evaluate our proposed BO framework on tasks from a range of ﬁelds and syntactical constraints. Our code is available at github.com/henrymoss/BOSS and is built upon the Emukit Python package [Paleyes et al., 2019]. All results are based on runs across 15 random seeds, showing the mean and a single standard error of the best objective value found as we increase the optimization budget.
Researcher Affiliation	Collaboration	Henry B. Moss STOR-i Centre for Doctoral Training Lancaster University, UK EMAIL Daniel Beck Computing and Information Systems University of Melbourne, Australia EMAIL Javier González Microsoft Research Cambridge, UK David S. Leslie Dept. of Mathematics and Statistics Lancaster University, UK Paul Rayson School of Computing and Communications Lancaster University, UK
Pseudocode	No	The paper describes algorithms in text and through figures but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at github.com/henrymoss/BOSS and is built upon the Emukit Python package [Paleyes et al., 2019].
Open Datasets	Yes	We replicate the symbolic regression example of Kusner et al. [2017], using their provided VAEs pre-trained for this exact problem. ...large collection of 250, 000 candidate molecules used by Kusner et al. [2017]...
Dataset Splits	No	The paper discusses training and testing for different models but does not provide explicit details on train/validation/test dataset splits (percentages or counts) for its own experiments.
Hardware Specification	Yes	Although acquisition function calculations could be parallelized across the populations of our GA at each BO step, we use a single-core Intel Xeon 2.30GHz processor to paint a clear picture of computational cost.
Software Dependencies	No	The paper mentions building upon the 'Emukit Python package' but does not provide specific version numbers for Emukit or Python, which are necessary for full reproducibility of software dependencies.
Experiment Setup	Yes	All results are based on runs across 15 random seeds, showing the mean and a single standard error of the best objective value found as we increase the optimization budget. ... After a random initialization of min(5, \|Σ\|) evaluations, kernel parameters are re-estimated to maximize model likelihood before each BO step. ... Our genetic algorithms (ga) limited to 100 evolutions of a population of size 100.