Expected Policy Gradients
Authors: Kamil Ciosek, Shimon Whiteson
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results confirming that this new approach to exploration substantially outperforms DPG with Ornstein-Uhlenbeck exploration in four challenging Mu Jo Co domains. Experiments While EPG has many potential uses, we focus on empirically evaluating one particular application: exploration driven by the Hessian exponential (as introduced in Algorithm 2 and Lemma 2), replacing the standard Ornstein-Uhlenbeck (OU) exploration in continuous action domains. |
| Researcher Affiliation | Academia | Kamil Ciosek, Shimon Whiteson Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD {kamil.ciosek,shimon.whiteson}@cs.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1 Expected Policy Gradients; Algorithm 2 Gaussian Policy Gradients; Algorithm 3 Gaussian Integrals |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the methodology described. |
| Open Datasets | Yes | To this end, we applied EPG to four domains modelled with the Mu Jo Co physics simulator (Todorov, Erez, and Tassa 2012): Half Cheetah-v1, Inverted Pendulum-v1, Reacher2d-v1 and Walker2d-v1 |
| Dataset Splits | No | The paper uses continuous control environments and does not specify explicit training, validation, or test dataset splits in terms of percentages or counts, as it generates data dynamically through interaction with the environment. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like the 'Mu Jo Co physics simulator' and 'OpenAI Gym' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The exploration hyperparameters for EPG were σ0 = 0.2 and c = 1.0 where the exploration covariance is σ0ec H. These values were obtained using a grid search from the set {0.2, 0.5, 1} for σ0 and {0.5, 1.0, 2.0} for c over the Half Cheetah-v1 domain. ... For SPG5, we used OU exploration and a constant diagonal covariance of 0.2 in the actor update (this approximately corresponds to the average variance of the OU process over time). |