reproducibilityindex.ai

Evolved Policy Gradients

Authors: Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, Pieter Abbeel

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method.
Researcher Affiliation	Collaboration	Open AI , UC Berkeley , MIT
Pseudocode	Yes	Algorithm 1: Evolved Policy Gradients (EPG)
Open Source Code	Yes	An implementation of EPG is available at http://github.com/openai/EPG.
Open Datasets	Yes	We apply our method to several randomized continuous control Mu Jo Co environments [1, 19, 4], namely Random Hopper and Random Walker (with randomized gravity, friction, body mass, and link thickness), Random Reacher (with randomized link lengths), Directional Hopper and Directional Half Cheetah (with randomized forward/backward reward function), Goal Ant (reward function based on the randomized target location), and Fetch (randomized target location).
Dataset Splits	No	The paper mentions 'metatraining' and 'test-time' tasks, and 'inner loop' training of policies. However, it does not provide specific numerical or percentage splits for training, validation, and test datasets in the conventional sense. The validation set is not explicitly mentioned for hyperparameter tuning or early stopping.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or types of computing resources used for the experiments.
Software Dependencies	No	The paper mentions software components like 'Mu Jo Co environments' and 'PPO [26]' as methods for comparison or environments, but it does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	The paper provides details such as 'for W inner-loop workers', 'U steps of experience', 'Every M steps the policy is updated through SGD', 'anneal α from 1 to 0 over a ﬁnite number of outer-loop epochs'. Specific values are shown in figures, e.g., '128 (policy updates) * 64 (update frequency) = 8196 timesteps' for Random Hopper in Figure 3.