Fast Population-Based Reinforcement Learning on a Single Machine
Authors: Arthur Flajolet, Claire Bizon Monroc, Karim Beguir, Thomas Pierrot
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we compare implementations and revisit previous studies to show that the judicious use of compilation and vectorization allows population-based training to be performed on a single machine with one accelerator with minimal overhead compared to training a single agent. We also show that, when provided with a few accelerators, our protocols extend to large population sizes for applications such as hyperparameter tuning. Numerical experiments In Figure 2, we benchmark the time it takes to carry out one update step as a function of the population size N given N data batches that have been previously loaded in the memory of the hardware accelerator (or in RAM memory if we are using a CPU). |
| Researcher Affiliation | Industry | Arthur Flajolet 1 Claire Bizon Monroc 1 Karim Beguir 1 Thomas Pierrot 1 1Insta Deep Ltd. Correspondence to: Arthur Flajolet <a.flajolet@instadeep.com>, Thomas Pierrot <t.pierrot@instadeep.com>. |
| Pseudocode | Yes | C. Example of manual vectorization with PYTORCH The code snippet is for a Multi Layer Perceptron (MLP) model (which is the core model used by default in the state-of-the-art implementations of SAC and TD3). from typing import Tuple import math import torch class Vectorized Linear Layer(...) class Vectorized MLP(...) |
| Open Source Code | Yes | We publicly release our code1 in the hope that it will encourage practitioners to use population-based learning more frequently for their research and applications. 1https://github.com/instadeepai/Fastpbrl |
| Open Datasets | Yes | Specifically, we use fullyconnected neural networks (resp. convolutional neural networks followed by fully-connected layers) to parametrize the critics and policies of SAC and TD3 (resp. the critic of DQN). For the empirical study of this section on SAC and TD3, we generate training data corresponding to training agents on the Mu Jo Co Gym Half Cheetah-v2 environment but similar results can be derived for robotic and locomotion simulation environments with higher dimensional observation and action spaces such as Humanoid-v2. For DQN, we use the same pipeline used in (Mnih et al., 2013) to preprocess and stack images from the Atari 2600 games. |
| Dataset Splits | No | The paper mentions generating 'training data' through interaction with environments (Mu Jo Co Gym, Atari 2600) and refers to 'test' performance in the context of evaluation, but it does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing as commonly understood in supervised learning. The nature of data collection in RL typically involves continuous interaction rather than pre-defined static splits. |
| Hardware Specification | Yes | CPU refers to a single thread core of an Intel Xeon 2.80Ghz processor. For this set of experiments, we use 4 T4 accelerators and 40 cpu cores of an Intel Xeon 2.80Ghz processor. K80 T4 V100 A100 (listed as accelerators in Figure 3 legend and Table 1). |
| Software Dependencies | No | We use the state-of-the-art implementations of SAC, TD3, and DQN from the ACME library (Hoffman et al., 2020) for the approaches relying on the JAX backend and the ones from Stable-Baselines3 (Raffin et al., 2021) for the approaches relying on the PYTORCH backend. In all cases we use the latest available version of the library available, the corresponding version of CUDA, and the latest available driver version for each hardware accelerator. |
| Experiment Setup | Yes | For all of these implementations, we also consider another variant where we carry out 50 (resp. 10) update steps in a row for TD3 and SAC (resp. DQN) without copying the values of the neural network parameters to the host memory between update steps. The entire population evolves at intervals of 100,000 update steps, the 30% bottom performing (based on the last 10 episode returns) agents are replaced by copies of randomly sampled agents among the top 30 % and have their hyperparameters re-sampled from a prior distribution. For TD3, we optimize: (1) the learning rates for the policy parameters and the critic parameters with log-uniform distributions with lower bounds 3e-5 and upper bounds 3e-3... (2) the frequency with which the policy parameters are updated w.r.t. the critic parameters (sampling a uniform value between 0.2 and 1), the (3) noise parameters (sampling uniform values between 0 and 1), as well as the (4) discount factor (sampling a uniform value between 0.9 and 1). |