Scalable Safe Policy Improvement via Monte Carlo Tree Search

Authors: Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, Matthijs T. J. Spaan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.
Researcher Affiliation Academia 1Department of Computer Science, University of Verona, Verona, Italy 2Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland 3Department of Software Science, Radboud University, Nijmegen, Netherlands 4Department of Software Technology, Delft University of Technology, Delft, Netherlands.
Pseudocode Yes Algorithms 1-4 show the pseudocode of MCTS-SPIBB.
Open Source Code Yes The original code of SPIBB1 and our code of MCTS-SPIBB2 are publicly available. 1https://github.com/Romain Laroche/SPIBB 2https://github.com/Isla-lab/mctsspibb
Open Datasets Yes We also empirically evaluate the proposed algorithm on three domains, i.e., Grid World (Russell & Norvig, 2020), Sys Admin (Guestrin et al., 2003) and Wet Chicken (Scholl et al., 2022b)
Dataset Splits No The paper describes generating datasets of trajectories and using them to compute MLE transition models and state-action pair counts. It does not specify conventional train/validation/test dataset splits (e.g., percentages or sample counts) for model training or evaluation in the way typically required to reproduce data partitioning for machine learning models.
Hardware Specification Yes Experiments were performed on a laptop with an 11th Gen Intel(R) Core(TM) i7-1165G7, 2.80 GHz with 10 GB RAM.
Software Dependencies No The paper mentions that code is available but does not provide specific software dependencies with version numbers (e.g., Python, libraries, frameworks).
Experiment Setup Yes Then, for each dataset, we compute the MLE transition model T D, the state-action pair count matrix ND(s, a) and the bootstrapped/non-bootstrapped action sets BA(s)/BA(s) using threshold N = 5 for Gridworld (average % of safe actions is |BA(s)|/|A| 100 = 81%) and N = 50 for Sys Admin (avg % of safe actions: 13.4%). Finally, for each dataset we generate the improved policy with both SPIBB and MCTS-SPIBB and compute the absolute difference between their values in the initial state s0, that is Vs0 = |V π M (s0) V πspibb M (s0)| (notice that in this test we compute the entire policy (all states) also with MCTS-SPIBB, and evaluate it using policy evaluation). Figure 2 shows the value of Vs0 (y-axis) for each domain (Fig. 2.a for Gridworld and Fig. 2.b for Sys Admin) and for each dataset (each point is a dataset) with m = 100, 1000, 10000 simulations (x-axis).