reproducibilityindex.ai

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Authors: Sumeet Batra, Bryon Tjanaka, Matthew Christopher Fontaine, Aleksei Petrenko, Stefanos Nikolaidis, Gaurav S. Sukhatme

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our algorithm on four different continuous-control locomotion tasks derived from the original Mujoco environments (Todorov et al., 2012): Ant, Walker2d, Half-Cheetah, and Humanoid. [...] Figures 3 and 4 show that PPGA outperforms baselines in best reward and QD-score, achieving comparable coverage scores on all tasks except Ant, and generating much more illuminated archive heatmaps with a diverse range of higher performing policies than the current state of the art, PGA-ME.
Researcher Affiliation	Academia	Sumeet Batra University of Southern California Los Angeles, CA 90089 ssbatra@usc.edu Bryon Tjanaka University of Southern California Los Angeles, CA 90089 tjanaka@usc.edu Matthew C. Fontaine University of Southern California Los Angeles, CA 90089 mfontain@usc.edu Aleksei Petrenko University of Southern California Los Angeles, CA 90089 petrenko@usc.edu Stefanos Nikolaidis University of Southern California Los Angeles, CA 90089 nikolaid@usc.edu Gaurav S. Sukhatme University of Southern California Los Angeles, CA 90089 gaurav@usc.edu
Pseudocode	Yes	We provide pseudocode in Appendix A. Algorithm 1 Proximal Policy Gradient Arborescence. Algorithm 2 Update Archive. Algorithm 3 Vectorized-PPO (VPPO).
Open Source Code	Yes	In the supplemental material, we provide the source code and training scripts used to produce our results.
Open Datasets	Yes	We evaluate our algorithm on four different continuous-control locomotion tasks derived from the original Mujoco environments (Todorov et al., 2012): Ant, Walker2d, Half-Cheetah, and Humanoid.
Dataset Splits	No	The paper refers to using 'environments' for evaluation and mentions 'rollout length' and 'episode length', but it does not specify explicit training, validation, and test dataset splits with percentages or counts.
Hardware Specification	Yes	Most experiments were run on a SLURM cluster where each job had access to an NVIDIA RTX 2080Ti GPUs, 4 cores from a Intel(R) Xeon(R) Gold 6154 3.00GHz CPU, and 108GB of RAM. Some additional experiments and ablations were run on local workstations with access to an NVIDIA RTX 3090, AMD Ryzen 7900x 12 core CPU, and 64GB of RAM.
Software Dependencies	No	The paper mentions implementing PPGA in pyribs and basing VPPO on Clean RL's PPO implementation, and states that documentation for setting up a Conda environment is included in the README. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	A full list of shared hyperparameters is in Appendix B. We use an archive learning rate of 0.1, 0.15, 0.1, and 1.0 on Humanoid, Walker2d, Ant, and Half-Cheetah, respectively. Adaptive standard deviation is enabled for Ant and Humanoid. We reset the action distribution standard deviation to 1.0 on each iteration in all other environments. Table 1 in Appendix B provides further hyperparameters such as ACTOR NETWORK [128, 128, ACTION DIM], CRITIC NETWORK [256, 256, 1], N1 10, N2 10, PPO NUM MINIBATCHES 8, PPO NUM EPOCHS 4, OBSERVATION NORMALIZATION TRUE, REWARD NORMALIZATION TRUE, ROLLOUT LENGTH 128.