Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bayesian Optimization for Policy Search via Online-Offline Experimentation

Authors: Benjamin Letham, Eytan Bakshy

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We measure empirical learning curves which show substantial gains from including data from biased offline experiments, and show how these learning curves are consistent with theoretical results for multi-task Gaussian process generalization. We find that improved kernel inference is a significant driver of multi-task generalization. Finally, we show several examples of Bayesian optimization efficiently tuning a live machine learning system by combining offline and online experiments.
Researcher Affiliation	Industry	Benjamin Letham EMAIL Eytan Bakshy EMAIL Facebook Menlo Park, California, USA
Pseudocode	Yes	Algorithm 1: Online-Offline Bayesian Optimization
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	No	The paper mentions using data from the "Facebook News Feed" and "live Facebook traffic" as well as a simulator. These are internal to Facebook and no specific links, DOIs, repository names, or formal citations are provided for public access to these datasets.
Dataset Splits	Yes	Each optimization began with a batch of 20 online observations and 100 simulator observations. The policies in these batches were generated from separate scrambled Sobol sequences (Owen, 1998) using a different seed. For each experiment, nine outcomes were measured which were treated independently for a total of 99 outcomes to be modeled. ... We evaluated prediction performance using leave-one-out crossvalidation, in which each online observation was removed from the data and the model fit to the remaining 19 observations (plus 100 simulator observations for the MTGP) was used to predict the online outcome of the held-out policy. ... To estimate the MTGP learning curve at n T online observations and n S simulator observations, we randomly sampled (without replacement) n T of the 20 online observations and n S of the 100 simulator observations and fit a model to only those data. We then made predictions at the held-out 20 − n T online points and evaluated mean squared prediction error and mean predictive variance. This was repeated with 500 samples of the data to approximate the expectation over XT , XS, and x in (7).
Hardware Specification	No	The paper discusses tuning an "online machine learning system" and running experiments on "Internet services" at Facebook. However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these systems or the experiments described.
Software Dependencies	No	The paper mentions "using the Scipy interface to L-BFGS (Byrd et al., 1995; Zhu et al., 1997)". While Scipy is a software library, a specific version number is not provided, nor are other key software components with versions.
Experiment Setup	Yes	Each optimization began with a batch of 20 online observations and 100 simulator observations. The policies in these batches were generated from separate scrambled Sobol sequences (Owen, 1998) using a different seed. ... In our experiments here, the initial quasi-random batches consisted of n T = 20 online tests and n S = 100 simulator tests. ... The full Bayesian optimization loop used here is given in Algorithm 1. ... In Line 5, we used the acquisition function to generate a batch of no = 30 optimized policies. In Line 8, we used Thompson sampling (Thompson, 1933) to select 8−10 policies for online tests, depending on available capacity. ... The constraint outcomes here are other metrics that have trade-offs with the objective, so improving one can often be to the detriment of the other. The experiments targeted different sets of value model parameters, with the design space dimensionality ranging from 10 to 20.