Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach
Authors: Alberto Maria Metelli, Matteo Pirotta, Daniele Calandriello, Marcello Restelli
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game. The main contributions of this paper are theoretical, algorithmic, and experimental consisting in: ... 4. an empirical evaluation and comparison of the proposed algorithms with related approaches (Section 6). |
| Researcher Affiliation | Collaboration | Alberto Maria Metelli EMAIL DEIB, Politecnico di Milano Milano, Italy Matteo Pirotta EMAIL Facebook AI Research Paris, France Daniele Calandriello EMAIL Istituto Italiano di Tecnologia Genova, Italy Marcello Restelli EMAIL DEIB, Politecnico di Milano Milano, Italy |
| Pseudocode | Yes | The pseudocode of USPI is reported in Algorithm 1. Algorithm 1 Exact USPI. Algorithm 2 Greedy Policy Chooser (GPC). Algorithm 3 Exact SSPI. Algorithm 4 Computing Υ (Forward Bound Optimizer FBO) Algorithm 5 Computing of the jump points for SSPI (Find Jump Point FJP) Algorithm 6 Exact SASPI. Algorithm 7 Computing of the jump points for SASPI (Find Jump Point FJP) Algorithm 8 Approximate Greedy Policy Chooser ([ GPC). Algorithm 9 Approximate USPI. Algorithm 10 Computing of the jump points for a SSPI (Find Jump Point d FJP) Algorithm 11 Computing of the jump points for a SASPI (FJP) Algorithm 12 Exact GSPI. Algorithm 13 Computing of the jump points for GSPI (Find Jump Point FJP) Algorithm 14 dπ Sampling Procedure (dπ-sample). Algorithm 15 GPC and a USPI Sampling Procedure (a USPI-sample). Algorithm 16 a SSPI Sampling Procedure (a SSPI-sample). Algorithm 17 a SASPI Sampling Procedure (a SASPI-sample). |
| Open Source Code | No | The paper does not provide concrete access to source code. There is no explicit statement about code release, nor a link to a code repository or mention of code in supplementary materials. |
| Open Datasets | Yes | The proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game. We have chosen the chain walk problem (Lagoudakis and Parr, 2003a) for its simplicity that makes the comparison with other approaches straightforward and particular instructional. This variant (Azar et al., 2012) is a M squared grid world surrounded by absorbing states (wall states). The Black Jack is a card game where the player seeks to beat the dealer by obtaining a total score greater than the dealer s one without exceeding 21 (refer to Dutech et al. (2005) for more details). |
| Dataset Splits | No | The paper describes reinforcement learning environments/domains (chain-walk, prison, Blackjack) rather than traditional datasets with explicit train/test/validation splits. There is no mention of specific dataset partitioning, percentages, or sample counts for different splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or specific cloud computing configurations used for running the experiments. |
| Software Dependencies | No | The paper does not list specific ancillary software details, such as library names with version numbers or specific solvers used. |
| Experiment Setup | Yes | All the algorithms have been initialized with the policy uniform over actions. Each algorithm was run with two different values for the approximation error ϵ, i.e., 0.05 and 0.1, and with an estimate probability δ = 0.1. Table 2: Results in approximate settings for the 4-states chain walk averaged over 10 runs for all the algorithms. Initial policies were stochastic policies chosen at random. For all algorithms, we used N = 1000 samples and a horizon of T = 20 for estimating the value function. For a SSPI, we used 1000 samples with and horizon 20 for estimating the state advantage function and 100 samples for estimating the derivative of the bound to be optimized. More specifically, we employed 200 samples with horizon 50 for the policy chooser, we used 50 with horizon 50 samples for estimating the state advantage function and the Q-function for a SSPI and a SASPI, respectively, and 50 samples for computing the bound derivative. Figure 13 reports the performance of the policies obtained by a PI, a CPI, a USPI, a SSPI, and a SASPI algorithms using 1000 samples with horizon 100 for the policy chooser, we used 100 with horizon 100 samples for estimating the state advantage function and the Q-function for a SSPI and a SASPI respectively and 100 samples for computing the bound derivative. |