Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safe Policy Improvement with Baseline Bootstrapping

Authors: Romain Laroche, Paul Trichelair, Remi Tachet Des Combes

ICML 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Then, in Section 3, we motivate our approach on a small stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB compared to existing algorithms, not only in safety but also in mean performance. Furthermore, we apply the model-free version to a continuous navigation task.
Researcher Affiliation	Industry	Romain Laroche 1 Paul Trichelair 1 Remi Tachet des Combes 1 1Microsoft Research, Montr eal, Canada.
Pseudocode	Yes	Algorithm 1 Greedy projection of Q(i) on Πb
Open Source Code	Yes	The code may be found at https://github.com/Romain Laroche/SPIBB and https://github.com/rems75/SPIBB-DQN.
Open Datasets	No	The paper describes custom experimental setups like a 'stochastic gridworld domain', 'randomly generated MDPs', and a 'helicopter navigation task' but does not provide access information (links, DOIs, formal citations) for publicly available datasets.
Dataset Splits	No	The paper describes a batch reinforcement learning setup where policies are trained on a fixed dataset and evaluated, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions software like TensorFlow and Keras but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We start by analysing the sensitivity of Πb-SPIBB and Π b-SPIBB with respect to N. We visually represent the results as two 1%-CVa R heatmaps: Figures 1(b) and 1(c) for Πb-SPIBB and Π b-SPIBB. ... In addition to the SPIBB algorithms, our ﬁnite MDP benchmark contains four algorithms: Basic RL, HCPI (Thomas et al., 2015a), Robust MDP, and Ra MDP (Petrik et al., 2016). Ra MDP stands for Reward-adjusted MDP and applies an exploration penalty when performing actions rarely observed in the dataset. At the exception of Basic RL, they all rely on one hyper-parameter: δhcpi, δrob and κadj respectively. We performed a grid search on those parameters and for HCPI compared 3 versions. In the main text, we only report the best performance we found (δhcpi = 0.9, δrob = 0.1, and κadj = 0.003), the full results can be found in Appendix B.2. ... To train our algorithms, we use a discount factor γ = 0.9, but we report in our results the undiscounted ﬁnal reward.