Scalable Safe Policy Improvement via Monte Carlo Tree Search
Authors: Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, Matthijs T. J. Spaan
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Verona, Verona, Italy 2Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland 3Department of Software Science, Radboud University, Nijmegen, Netherlands 4Department of Software Technology, Delft University of Technology, Delft, Netherlands. |
| Pseudocode | Yes | Algorithms 1-4 show the pseudocode of MCTS-SPIBB. |
| Open Source Code | Yes | The original code of SPIBB1 and our code of MCTS-SPIBB2 are publicly available. 1https://github.com/Romain Laroche/SPIBB 2https://github.com/Isla-lab/mctsspibb |
| Open Datasets | Yes | We also empirically evaluate the proposed algorithm on three domains, i.e., Grid World (Russell & Norvig, 2020), Sys Admin (Guestrin et al., 2003) and Wet Chicken (Scholl et al., 2022b) |
| Dataset Splits | No | The paper describes generating datasets of trajectories and using them to compute MLE transition models and state-action pair counts. It does not specify conventional train/validation/test dataset splits (e.g., percentages or sample counts) for model training or evaluation in the way typically required to reproduce data partitioning for machine learning models. |
| Hardware Specification | Yes | Experiments were performed on a laptop with an 11th Gen Intel(R) Core(TM) i7-1165G7, 2.80 GHz with 10 GB RAM. |
| Software Dependencies | No | The paper mentions that code is available but does not provide specific software dependencies with version numbers (e.g., Python, libraries, frameworks). |
| Experiment Setup | Yes | Then, for each dataset, we compute the MLE transition model T D, the state-action pair count matrix ND(s, a) and the bootstrapped/non-bootstrapped action sets BA(s)/BA(s) using threshold N = 5 for Gridworld (average % of safe actions is |BA(s)|/|A| 100 = 81%) and N = 50 for Sys Admin (avg % of safe actions: 13.4%). Finally, for each dataset we generate the improved policy with both SPIBB and MCTS-SPIBB and compute the absolute difference between their values in the initial state s0, that is Vs0 = |V π M (s0) V πspibb M (s0)| (notice that in this test we compute the entire policy (all states) also with MCTS-SPIBB, and evaluate it using policy evaluation). Figure 2 shows the value of Vs0 (y-axis) for each domain (Fig. 2.a for Gridworld and Fig. 2.b for Sys Admin) and for each dataset (each point is a dataset) with m = 100, 1000, 10000 simulations (x-axis). |