reproducibilityindex.ai

Learning Optimal Deterministic Policies with Stochastic Policy Gradients

Authors: Alessandro Montenegro, Marco Mussi, Alberto Maria Metelli, Matteo Papini

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we quantitatively compare action-based and parameter-based exploration, giving a formal guise to intuitive results. We also elaborate on how the assumptions used in the convergence analysis can be reconnected to the basic characteristics of the MDP and the policy classes. We conclude with a numerical validation to empirically illustrate the discussed trade-offs.
Researcher Affiliation	Academia	1Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy. Correspondence to: Alessandro Montenegro <alessandro.montenegro@polimi.it>.
Pseudocode	Yes	In this section we report the algorithm PGPE as it is reported in its original paper (Sehnke et al., 2010). In particular, we show the pseudo-code (Algorithm 1) of its original basic version... As done for PGPE, here we report the algorithm GPOMDP in its original version (Baxter & Bartlett, 2001; Peters & Schaal, 2006). We show the pseudo-code (Algorithm 2) of such original basic version...
Open Source Code	Yes	The code is available at https://github.com/ Montenegro Alessandro/Magic RL.
Open Datasets	Yes	We run PGPE and GPOMDP for K 2000 iterations with batch size N 100 on three environments from the Mu Jo Co (Todorov et al., 2012) suite: Swimmer-v4 (T 200), Hopper-v4 (T 100), and Half Cheetah-v4 (T 100).
Dataset Splits	No	The paper describes experiments conducted within simulation environments where data is generated dynamically. It does not specify fixed training, validation, or test dataset splits in the traditional sense.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba, 2014)' for setting the step size and 'Mu Jo Co (Todorov et al., 2012)' for environments, but it does not specify explicit version numbers for these or other software libraries/dependencies.
Experiment Setup	Yes	We run PGPE and GPOMDP for K 2000 iterations with batch size N 100 on three environments from the Mu Jo Co (Todorov et al., 2012) suite: Swimmer-v4 (T 200), Hopper-v4 (T 100), and Half Cheetah-v4 (T 100). For all the environments the deterministic policy is linear in the state and the noise is Gaussian. We consider σ2 : P t0.01,0.1,1,10,100u. More details in Appendix H.1. ... We employed Adam (Kingma & Ba, 2014) to set the step size with initial values 0.1 for PGPE and 0.01 for GPOMDP.