Effective Diversity in Population Based Reinforcement Learning

Authors: Jack Parker-Holder, Aldo Pacchiano, Krzysztof M. Choromanski, Stephen J. Roberts

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here evaluate Dv D-ES and Dv D-TD3 in a variety of challenging settings. We focus first on the ES setting, since ES is cheap to run on CPUs [49] which allows us to run a series of ablation studies. In Sec. 6 we provide empirical evidence of the effectiveness of Dv D.
Researcher Affiliation Collaboration Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Aldo Pacchiano UC Berkeley pacchiano@berkeley.edu Krzysztof Choromanski Google Brain Robotics kchoro@google.com Stephen Roberts University of Oxford sjrob@robots.ox.ac.uk
Pseudocode No The paper describes the algorithms (Dv D-ES and Dv D-TD3) in text and mathematical equations in Section 4, but it does not include a distinct 'Pseudocode' or 'Algorithm' block/figure.
Open Source Code Yes To run these experiments see the following repo: https://github.com/jparkerholder/Dv D_ES.
Open Datasets Yes We begin with a simple environment, whereby a two dimensional point agent is given a reward equal to the negative distance away from a goal. The agent is separated from its goal by the wall (see: Fig. 3a)). our environments are based on the Cheetah and Ant four widely studied continuous control tasks from Open AI Gym. Humanoid environment from the Open AI Gym [5]. [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Open AI Gym, 2016.
Dataset Splits No The paper discusses training and testing procedures (e.g., 'We train each for a total of one million timesteps'), but it does not explicitly mention the use or size of a validation dataset split.
Hardware Specification Yes All experiments made use of the ray [36] library for parallel computing, with experiments run on a 32-core machine.
Software Dependencies No The paper mentions using the 'ray [36] library', but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes We parameterize our policies with two hidden layer neural networks, with tanh activations (more details are in the Appendix, Section 8.2). We train Dv D-TD3 with M = 5 agents, where each agent has its own neural network policy, but a shared Q-function. We benchmark against both a single agent (M = 1), which is vanilla TD3, and then what we call ensemble TD3 (E-TD3) where M = 5 but there is no diversity term. We initialize all methods with 25, 000 random timesteps, where we set λt = 0 for Dv D-TD3. We train each for a total of one million timesteps, and repeat for 7 seeds.