Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Expected Policy Gradients

Authors: Kamil Ciosek, Shimon Whiteson

AAAI 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results con๏ฌrming that this new approach to exploration substantially outperforms DPG with Ornstein-Uhlenbeck exploration in four challenging Mu Jo Co domains. Experiments While EPG has many potential uses, we focus on empirically evaluating one particular application: exploration driven by the Hessian exponential (as introduced in Algorithm 2 and Lemma 2), replacing the standard Ornstein-Uhlenbeck (OU) exploration in continuous action domains.
Researcher Affiliation Academia Kamil Ciosek, Shimon Whiteson Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD EMAIL
Pseudocode Yes Algorithm 1 Expected Policy Gradients; Algorithm 2 Gaussian Policy Gradients; Algorithm 3 Gaussian Integrals
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the methodology described.
Open Datasets Yes To this end, we applied EPG to four domains modelled with the Mu Jo Co physics simulator (Todorov, Erez, and Tassa 2012): Half Cheetah-v1, Inverted Pendulum-v1, Reacher2d-v1 and Walker2d-v1
Dataset Splits No The paper uses continuous control environments and does not specify explicit training, validation, or test dataset splits in terms of percentages or counts, as it generates data dynamically through interaction with the environment.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software like the 'Mu Jo Co physics simulator' and 'OpenAI Gym' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The exploration hyperparameters for EPG were ฯƒ0 = 0.2 and c = 1.0 where the exploration covariance is ฯƒ0ec H. These values were obtained using a grid search from the set {0.2, 0.5, 1} for ฯƒ0 and {0.5, 1.0, 2.0} for c over the Half Cheetah-v1 domain. ... For SPG5, we used OU exploration and a constant diagonal covariance of 0.2 in the actor update (this approximately corresponds to the average variance of the OU process over time).