Effective Diversity in Population Based Reinforcement Learning
Authors: Jack Parker-Holder, Aldo Pacchiano, Krzysztof M. Choromanski, Stephen J. Roberts
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here evaluate Dv D-ES and Dv D-TD3 in a variety of challenging settings. We focus first on the ES setting, since ES is cheap to run on CPUs [49] which allows us to run a series of ablation studies. In Sec. 6 we provide empirical evidence of the effectiveness of Dv D. |
| Researcher Affiliation | Collaboration | Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Aldo Pacchiano UC Berkeley pacchiano@berkeley.edu Krzysztof Choromanski Google Brain Robotics kchoro@google.com Stephen Roberts University of Oxford sjrob@robots.ox.ac.uk |
| Pseudocode | No | The paper describes the algorithms (Dv D-ES and Dv D-TD3) in text and mathematical equations in Section 4, but it does not include a distinct 'Pseudocode' or 'Algorithm' block/figure. |
| Open Source Code | Yes | To run these experiments see the following repo: https://github.com/jparkerholder/Dv D_ES. |
| Open Datasets | Yes | We begin with a simple environment, whereby a two dimensional point agent is given a reward equal to the negative distance away from a goal. The agent is separated from its goal by the wall (see: Fig. 3a)). our environments are based on the Cheetah and Ant four widely studied continuous control tasks from Open AI Gym. Humanoid environment from the Open AI Gym [5]. [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Open AI Gym, 2016. |
| Dataset Splits | No | The paper discusses training and testing procedures (e.g., 'We train each for a total of one million timesteps'), but it does not explicitly mention the use or size of a validation dataset split. |
| Hardware Specification | Yes | All experiments made use of the ray [36] library for parallel computing, with experiments run on a 32-core machine. |
| Software Dependencies | No | The paper mentions using the 'ray [36] library', but does not provide a specific version number for it or any other software dependencies. |
| Experiment Setup | Yes | We parameterize our policies with two hidden layer neural networks, with tanh activations (more details are in the Appendix, Section 8.2). We train Dv D-TD3 with M = 5 agents, where each agent has its own neural network policy, but a shared Q-function. We benchmark against both a single agent (M = 1), which is vanilla TD3, and then what we call ensemble TD3 (E-TD3) where M = 5 but there is no diversity term. We initialize all methods with 25, 000 random timesteps, where we set λt = 0 for Dv D-TD3. We train each for a total of one million timesteps, and repeat for 7 seeds. |