Safe Policy Improvement with Baseline Bootstrapping
Authors: Romain Laroche, Paul Trichelair, Remi Tachet Des Combes
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Then, in Section 3, we motivate our approach on a small stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB compared to existing algorithms, not only in safety but also in mean performance. Furthermore, we apply the model-free version to a continuous navigation task. |
| Researcher Affiliation | Industry | Romain Laroche 1 Paul Trichelair 1 Remi Tachet des Combes 1 1Microsoft Research, Montr eal, Canada. |
| Pseudocode | Yes | Algorithm 1 Greedy projection of Q(i) on Πb |
| Open Source Code | Yes | The code may be found at https://github.com/Romain Laroche/SPIBB and https://github.com/rems75/SPIBB-DQN. |
| Open Datasets | No | The paper describes custom experimental setups like a 'stochastic gridworld domain', 'randomly generated MDPs', and a 'helicopter navigation task' but does not provide access information (links, DOIs, formal citations) for publicly available datasets. |
| Dataset Splits | No | The paper describes a batch reinforcement learning setup where policies are trained on a fixed dataset and evaluated, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions software like TensorFlow and Keras but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We start by analysing the sensitivity of Πb-SPIBB and Π b-SPIBB with respect to N. We visually represent the results as two 1%-CVa R heatmaps: Figures 1(b) and 1(c) for Πb-SPIBB and Π b-SPIBB. ... In addition to the SPIBB algorithms, our finite MDP benchmark contains four algorithms: Basic RL, HCPI (Thomas et al., 2015a), Robust MDP, and Ra MDP (Petrik et al., 2016). Ra MDP stands for Reward-adjusted MDP and applies an exploration penalty when performing actions rarely observed in the dataset. At the exception of Basic RL, they all rely on one hyper-parameter: δhcpi, δrob and κadj respectively. We performed a grid search on those parameters and for HCPI compared 3 versions. In the main text, we only report the best performance we found (δhcpi = 0.9, δrob = 0.1, and κadj = 0.003), the full results can be found in Appendix B.2. ... To train our algorithms, we use a discount factor γ = 0.9, but we report in our results the undiscounted final reward. |