Evolved Policy Gradients
Authors: Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, Pieter Abbeel
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method. |
| Researcher Affiliation | Collaboration | Open AI , UC Berkeley , MIT |
| Pseudocode | Yes | Algorithm 1: Evolved Policy Gradients (EPG) |
| Open Source Code | Yes | An implementation of EPG is available at http://github.com/openai/EPG. |
| Open Datasets | Yes | We apply our method to several randomized continuous control Mu Jo Co environments [1, 19, 4], namely Random Hopper and Random Walker (with randomized gravity, friction, body mass, and link thickness), Random Reacher (with randomized link lengths), Directional Hopper and Directional Half Cheetah (with randomized forward/backward reward function), Goal Ant (reward function based on the randomized target location), and Fetch (randomized target location). |
| Dataset Splits | No | The paper mentions 'metatraining' and 'test-time' tasks, and 'inner loop' training of policies. However, it does not provide specific numerical or percentage splits for training, validation, and test datasets in the conventional sense. The validation set is not explicitly mentioned for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or types of computing resources used for the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Mu Jo Co environments' and 'PPO [26]' as methods for comparison or environments, but it does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The paper provides details such as 'for W inner-loop workers', 'U steps of experience', 'Every M steps the policy is updated through SGD', 'anneal α from 1 to 0 over a finite number of outer-loop epochs'. Specific values are shown in figures, e.g., '128 (policy updates) * 64 (update frequency) = 8196 timesteps' for Random Hopper in Figure 3. |