reproducibilityindex.ai

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Authors: Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, Matthijs T. J. Spaan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to signiﬁcantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.
Researcher Affiliation	Academia	1Department of Computer Science, University of Verona, Verona, Italy 2Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland 3Department of Software Science, Radboud University, Nijmegen, Netherlands 4Department of Software Technology, Delft University of Technology, Delft, Netherlands.
Pseudocode	Yes	Algorithms 1-4 show the pseudocode of MCTS-SPIBB.
Open Source Code	Yes	The original code of SPIBB1 and our code of MCTS-SPIBB2 are publicly available. 1https://github.com/Romain Laroche/SPIBB 2https://github.com/Isla-lab/mctsspibb
Open Datasets	Yes	We also empirically evaluate the proposed algorithm on three domains, i.e., Grid World (Russell & Norvig, 2020), Sys Admin (Guestrin et al., 2003) and Wet Chicken (Scholl et al., 2022b)
Dataset Splits	No	The paper describes generating datasets of trajectories and using them to compute MLE transition models and state-action pair counts. It does not specify conventional train/validation/test dataset splits (e.g., percentages or sample counts) for model training or evaluation in the way typically required to reproduce data partitioning for machine learning models.
Hardware Specification	Yes	Experiments were performed on a laptop with an 11th Gen Intel(R) Core(TM) i7-1165G7, 2.80 GHz with 10 GB RAM.
Software Dependencies	No	The paper mentions that code is available but does not provide specific software dependencies with version numbers (e.g., Python, libraries, frameworks).
Experiment Setup	Yes	Then, for each dataset, we compute the MLE transition model T D, the state-action pair count matrix ND(s, a) and the bootstrapped/non-bootstrapped action sets BA(s)/BA(s) using threshold N = 5 for Gridworld (average % of safe actions is \|BA(s)\|/\|A\| 100 = 81%) and N = 50 for Sys Admin (avg % of safe actions: 13.4%). Finally, for each dataset we generate the improved policy with both SPIBB and MCTS-SPIBB and compute the absolute difference between their values in the initial state s0, that is Vs0 = \|V π M (s0) V πspibb M (s0)\| (notice that in this test we compute the entire policy (all states) also with MCTS-SPIBB, and evaluate it using policy evaluation). Figure 2 shows the value of Vs0 (y-axis) for each domain (Fig. 2.a for Gridworld and Fig. 2.b for Sys Admin) and for each dataset (each point is a dataset) with m = 100, 1000, 10000 simulations (x-axis).