reproducibilityindex.ai

Evolution-Guided Policy Gradient in Reinforcement Learning

Authors: Shauharda Khadka, Kagan Tumer

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in a range of challenging continuous control benchmarks demonstrate that ERL significantly outperforms prior DRL and EA methods.
Researcher Affiliation	Academia	Shauharda Khadka Kagan Tumer Collaborative Robotics and Intelligent Systems Institute Oregon State University {khadkas,kagan.tumer}@oregonstate.edu
Pseudocode	Yes	Algorithm 1, 2 and 3 provide a detailed pseudocode of the ERL algorithm using DDPG as its policy gradient component.
Open Source Code	Yes	Code available at https://github.com/Shaw K91/erl_paper_nips18
Open Datasets	Yes	We evaluated the performance of ERL1 agents on 6 continuous control tasks simulated using Mujoco [56]. These are benchmarks used widely in the ﬁeld [13, 25, 53, 47] and are hosted through the Open AI gym [6].
Dataset Splits	No	The paper does not specify explicit dataset split percentages (e.g., 80/10/10) or absolute sample counts for training, validation, and test sets. It mentions using well-known environments for evaluation but not how data within these environments is formally partitioned into these splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	ERL is implemented using Py Torch [39] while Open AI Baselines [11] was used to implement PPO and DDPG. While software names are mentioned, specific version numbers for PyTorch or OpenAI Baselines are not provided.
Experiment Setup	Yes	Adam [29] optimizer with gradient clipping at 10 and a learning rate of 5e 5 and 5e 4 was used for the rlactor and rlcritic, respectively. The size of the population k was set to 10, while the elite fraction ψ varied from 0.1 to 0.3 across tasks. The number of trials conducted to compute a ﬁtness score, ξ ranged from 1 to 5 across tasks. The size of the replay buffer and batch size were set to 1e6 and 128, respectively. The discount rate γ and target weight τ were set to 0.99 and 1e 3, respectively. The mutation probability mutprob was set to 0.9 while the syncronization period ω ranged from 1 to 10 across tasks. The mutation strength mutstrength was set to 0.1 corresponding to a 10% Gaussian noise. Finally, the mutation fraction mutfrac was set to 0.1 while the probability from super mutation supermutprob and reset resetmutprob were set to 0.05.