Learning Optimal Deterministic Policies with Stochastic Policy Gradients
Authors: Alessandro Montenegro, Marco Mussi, Alberto Maria Metelli, Matteo Papini
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we quantitatively compare action-based and parameter-based exploration, giving a formal guise to intuitive results. We also elaborate on how the assumptions used in the convergence analysis can be reconnected to the basic characteristics of the MDP and the policy classes. We conclude with a numerical validation to empirically illustrate the discussed trade-offs. |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy. Correspondence to: Alessandro Montenegro <alessandro.montenegro@polimi.it>. |
| Pseudocode | Yes | In this section we report the algorithm PGPE as it is reported in its original paper (Sehnke et al., 2010). In particular, we show the pseudo-code (Algorithm 1) of its original basic version... As done for PGPE, here we report the algorithm GPOMDP in its original version (Baxter & Bartlett, 2001; Peters & Schaal, 2006). We show the pseudo-code (Algorithm 2) of such original basic version... |
| Open Source Code | Yes | The code is available at https://github.com/ Montenegro Alessandro/Magic RL. |
| Open Datasets | Yes | We run PGPE and GPOMDP for K 2000 iterations with batch size N 100 on three environments from the Mu Jo Co (Todorov et al., 2012) suite: Swimmer-v4 (T 200), Hopper-v4 (T 100), and Half Cheetah-v4 (T 100). |
| Dataset Splits | No | The paper describes experiments conducted within simulation environments where data is generated dynamically. It does not specify fixed training, validation, or test dataset splits in the traditional sense. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2014)' for setting the step size and 'Mu Jo Co (Todorov et al., 2012)' for environments, but it does not specify explicit version numbers for these or other software libraries/dependencies. |
| Experiment Setup | Yes | We run PGPE and GPOMDP for K 2000 iterations with batch size N 100 on three environments from the Mu Jo Co (Todorov et al., 2012) suite: Swimmer-v4 (T 200), Hopper-v4 (T 100), and Half Cheetah-v4 (T 100). For all the environments the deterministic policy is linear in the state and the noise is Gaussian. We consider σ2 : P t0.01,0.1,1,10,100u. More details in Appendix H.1. ... We employed Adam (Kingma & Ba, 2014) to set the step size with initial values 0.1 for PGPE and 0.01 for GPOMDP. |