Expected Policy Gradients

Authors: Kamil Ciosek, Shimon Whiteson

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results confirming that this new approach to exploration substantially outperforms DPG with Ornstein-Uhlenbeck exploration in four challenging Mu Jo Co domains. Experiments While EPG has many potential uses, we focus on empirically evaluating one particular application: exploration driven by the Hessian exponential (as introduced in Algorithm 2 and Lemma 2), replacing the standard Ornstein-Uhlenbeck (OU) exploration in continuous action domains.
Researcher Affiliation Academia Kamil Ciosek, Shimon Whiteson Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD {kamil.ciosek,shimon.whiteson}@cs.ox.ac.uk
Pseudocode Yes Algorithm 1 Expected Policy Gradients; Algorithm 2 Gaussian Policy Gradients; Algorithm 3 Gaussian Integrals
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the methodology described.
Open Datasets Yes To this end, we applied EPG to four domains modelled with the Mu Jo Co physics simulator (Todorov, Erez, and Tassa 2012): Half Cheetah-v1, Inverted Pendulum-v1, Reacher2d-v1 and Walker2d-v1
Dataset Splits No The paper uses continuous control environments and does not specify explicit training, validation, or test dataset splits in terms of percentages or counts, as it generates data dynamically through interaction with the environment.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software like the 'Mu Jo Co physics simulator' and 'OpenAI Gym' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The exploration hyperparameters for EPG were σ0 = 0.2 and c = 1.0 where the exploration covariance is σ0ec H. These values were obtained using a grid search from the set {0.2, 0.5, 1} for σ0 and {0.5, 1.0, 2.0} for c over the Half Cheetah-v1 domain. ... For SPG5, we used OU exploration and a constant diagonal covariance of 0.2 in the actor update (this approximately corresponds to the average variance of the OU process over time).