Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning
Authors: Sumeet Batra, Bryon Tjanaka, Matthew Christopher Fontaine, Aleksei Petrenko, Stefanos Nikolaidis, Gaurav S. Sukhatme
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our algorithm on four different continuous-control locomotion tasks derived from the original Mujoco environments (Todorov et al., 2012): Ant, Walker2d, Half-Cheetah, and Humanoid. [...] Figures 3 and 4 show that PPGA outperforms baselines in best reward and QD-score, achieving comparable coverage scores on all tasks except Ant, and generating much more illuminated archive heatmaps with a diverse range of higher performing policies than the current state of the art, PGA-ME. |
| Researcher Affiliation | Academia | Sumeet Batra University of Southern California Los Angeles, CA 90089 ssbatra@usc.edu Bryon Tjanaka University of Southern California Los Angeles, CA 90089 tjanaka@usc.edu Matthew C. Fontaine University of Southern California Los Angeles, CA 90089 mfontain@usc.edu Aleksei Petrenko University of Southern California Los Angeles, CA 90089 petrenko@usc.edu Stefanos Nikolaidis University of Southern California Los Angeles, CA 90089 nikolaid@usc.edu Gaurav S. Sukhatme University of Southern California Los Angeles, CA 90089 gaurav@usc.edu |
| Pseudocode | Yes | We provide pseudocode in Appendix A. Algorithm 1 Proximal Policy Gradient Arborescence. Algorithm 2 Update Archive. Algorithm 3 Vectorized-PPO (VPPO). |
| Open Source Code | Yes | In the supplemental material, we provide the source code and training scripts used to produce our results. |
| Open Datasets | Yes | We evaluate our algorithm on four different continuous-control locomotion tasks derived from the original Mujoco environments (Todorov et al., 2012): Ant, Walker2d, Half-Cheetah, and Humanoid. |
| Dataset Splits | No | The paper refers to using 'environments' for evaluation and mentions 'rollout length' and 'episode length', but it does not specify explicit training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | Yes | Most experiments were run on a SLURM cluster where each job had access to an NVIDIA RTX 2080Ti GPUs, 4 cores from a Intel(R) Xeon(R) Gold 6154 3.00GHz CPU, and 108GB of RAM. Some additional experiments and ablations were run on local workstations with access to an NVIDIA RTX 3090, AMD Ryzen 7900x 12 core CPU, and 64GB of RAM. |
| Software Dependencies | No | The paper mentions implementing PPGA in pyribs and basing VPPO on Clean RL's PPO implementation, and states that documentation for setting up a Conda environment is included in the README. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | A full list of shared hyperparameters is in Appendix B. We use an archive learning rate of 0.1, 0.15, 0.1, and 1.0 on Humanoid, Walker2d, Ant, and Half-Cheetah, respectively. Adaptive standard deviation is enabled for Ant and Humanoid. We reset the action distribution standard deviation to 1.0 on each iteration in all other environments. Table 1 in Appendix B provides further hyperparameters such as ACTOR NETWORK [128, 128, ACTION DIM], CRITIC NETWORK [256, 256, 1], N1 10, N2 10, PPO NUM MINIBATCHES 8, PPO NUM EPOCHS 4, OBSERVATION NORMALIZATION TRUE, REWARD NORMALIZATION TRUE, ROLLOUT LENGTH 128. |