Evolution-Guided Policy Gradient in Reinforcement Learning
Authors: Shauharda Khadka, Kagan Tumer
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in a range of challenging continuous control benchmarks demonstrate that ERL significantly outperforms prior DRL and EA methods. |
| Researcher Affiliation | Academia | Shauharda Khadka Kagan Tumer Collaborative Robotics and Intelligent Systems Institute Oregon State University {khadkas,kagan.tumer}@oregonstate.edu |
| Pseudocode | Yes | Algorithm 1, 2 and 3 provide a detailed pseudocode of the ERL algorithm using DDPG as its policy gradient component. |
| Open Source Code | Yes | Code available at https://github.com/Shaw K91/erl_paper_nips18 |
| Open Datasets | Yes | We evaluated the performance of ERL1 agents on 6 continuous control tasks simulated using Mujoco [56]. These are benchmarks used widely in the field [13, 25, 53, 47] and are hosted through the Open AI gym [6]. |
| Dataset Splits | No | The paper does not specify explicit dataset split percentages (e.g., 80/10/10) or absolute sample counts for training, validation, and test sets. It mentions using well-known environments for evaluation but not how data within these environments is formally partitioned into these splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | ERL is implemented using Py Torch [39] while Open AI Baselines [11] was used to implement PPO and DDPG. While software names are mentioned, specific version numbers for PyTorch or OpenAI Baselines are not provided. |
| Experiment Setup | Yes | Adam [29] optimizer with gradient clipping at 10 and a learning rate of 5e 5 and 5e 4 was used for the rlactor and rlcritic, respectively. The size of the population k was set to 10, while the elite fraction ψ varied from 0.1 to 0.3 across tasks. The number of trials conducted to compute a fitness score, ξ ranged from 1 to 5 across tasks. The size of the replay buffer and batch size were set to 1e6 and 128, respectively. The discount rate γ and target weight τ were set to 0.99 and 1e 3, respectively. The mutation probability mutprob was set to 0.9 while the syncronization period ω ranged from 1 to 10 across tasks. The mutation strength mutstrength was set to 0.1 corresponding to a 10% Gaussian noise. Finally, the mutation fraction mutfrac was set to 0.1 while the probability from super mutation supermutprob and reset resetmutprob were set to 0.05. |