Evolution-Guided Policy Gradient in Reinforcement Learning

Authors: Shauharda Khadka, Kagan Tumer

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in a range of challenging continuous control benchmarks demonstrate that ERL significantly outperforms prior DRL and EA methods.
Researcher Affiliation Academia Shauharda Khadka Kagan Tumer Collaborative Robotics and Intelligent Systems Institute Oregon State University {khadkas,kagan.tumer}@oregonstate.edu
Pseudocode Yes Algorithm 1, 2 and 3 provide a detailed pseudocode of the ERL algorithm using DDPG as its policy gradient component.
Open Source Code Yes Code available at https://github.com/Shaw K91/erl_paper_nips18
Open Datasets Yes We evaluated the performance of ERL1 agents on 6 continuous control tasks simulated using Mujoco [56]. These are benchmarks used widely in the field [13, 25, 53, 47] and are hosted through the Open AI gym [6].
Dataset Splits No The paper does not specify explicit dataset split percentages (e.g., 80/10/10) or absolute sample counts for training, validation, and test sets. It mentions using well-known environments for evaluation but not how data within these environments is formally partitioned into these splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No ERL is implemented using Py Torch [39] while Open AI Baselines [11] was used to implement PPO and DDPG. While software names are mentioned, specific version numbers for PyTorch or OpenAI Baselines are not provided.
Experiment Setup Yes Adam [29] optimizer with gradient clipping at 10 and a learning rate of 5e 5 and 5e 4 was used for the rlactor and rlcritic, respectively. The size of the population k was set to 10, while the elite fraction ψ varied from 0.1 to 0.3 across tasks. The number of trials conducted to compute a fitness score, ξ ranged from 1 to 5 across tasks. The size of the replay buffer and batch size were set to 1e6 and 128, respectively. The discount rate γ and target weight τ were set to 0.99 and 1e 3, respectively. The mutation probability mutprob was set to 0.9 while the syncronization period ω ranged from 1 to 10 across tasks. The mutation strength mutstrength was set to 0.1 corresponding to a 10% Gaussian noise. Finally, the mutation fraction mutfrac was set to 0.1 while the probability from super mutation supermutprob and reset resetmutprob were set to 0.05.