Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Safe Policy Improvement with Baseline Bootstrapping
Authors: Romain Laroche, Paul Trichelair, Remi Tachet Des Combes
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Then, in Section 3, we motivate our approach on a small stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB compared to existing algorithms, not only in safety but also in mean performance. Furthermore, we apply the model-free version to a continuous navigation task. |
| Researcher Affiliation | Industry | Romain Laroche 1 Paul Trichelair 1 Remi Tachet des Combes 1 1Microsoft Research, Montr eal, Canada. |
| Pseudocode | Yes | Algorithm 1 Greedy projection of Q(i) on Πb |
| Open Source Code | Yes | The code may be found at https://github.com/Romain Laroche/SPIBB and https://github.com/rems75/SPIBB-DQN. |
| Open Datasets | No | The paper describes custom experimental setups like a 'stochastic gridworld domain', 'randomly generated MDPs', and a 'helicopter navigation task' but does not provide access information (links, DOIs, formal citations) for publicly available datasets. |
| Dataset Splits | No | The paper describes a batch reinforcement learning setup where policies are trained on a fixed dataset and evaluated, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions software like TensorFlow and Keras but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We start by analysing the sensitivity of Πb-SPIBB and Π b-SPIBB with respect to N. We visually represent the results as two 1%-CVa R heatmaps: Figures 1(b) and 1(c) for Πb-SPIBB and Π b-SPIBB. ... In addition to the SPIBB algorithms, our finite MDP benchmark contains four algorithms: Basic RL, HCPI (Thomas et al., 2015a), Robust MDP, and Ra MDP (Petrik et al., 2016). Ra MDP stands for Reward-adjusted MDP and applies an exploration penalty when performing actions rarely observed in the dataset. At the exception of Basic RL, they all rely on one hyper-parameter: δhcpi, δrob and κadj respectively. We performed a grid search on those parameters and for HCPI compared 3 versions. In the main text, we only report the best performance we found (δhcpi = 0.9, δrob = 0.1, and κadj = 0.003), the full results can be found in Appendix B.2. ... To train our algorithms, we use a discount factor γ = 0.9, but we report in our results the undiscounted final reward. |