Distributed Prioritized Experience Replay
Authors: Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, David Silver
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use this distributed architecture to scale up variants of Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), and we evaluate these on the Arcade Learning Environment benchmark (Bellemare et al., 2013), and on a range of continuous control tasks. Our architecture achieves a new state of the art performance on Atari games, using a fraction of the wall-clock time compared to the previous state of the art, and without per-game hyperparameter tuning. We empirically investigate the scalability of our framework, analysing how prioritization affects performance as we increase the number of data-generating workers. Our experiments include an analysis of factors such as the replay capacity, the recency of the experience, and the use of different data-generating policies for different workers. |
| Researcher Affiliation | Industry | Dan Horgan Deep Mind horgan@google.com John Quan Deep Mind johnquan@google.com David Budden Deep Mind budden@google.com Gabriel Barth-Maron Deep Mind gabrielbm@google.com Matteo Hessel Deep Mind mtthss@google.com Hado van Hasselt Deep Mind hado@google.com David Silver Deep Mind davidsilver@google.com |
| Pseudocode | Yes | Pseudocode for the actors and learners is shown in Algorithms 1 and 2. |
| Open Source Code | No | The paper does not provide an explicit statement or a link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | We use this distributed architecture to scale up variants of Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), and we evaluate these on the Arcade Learning Environment benchmark (Bellemare et al., 2013), and on a range of continuous control tasks. Benchmarking was performed in two continuous control domains ((a) Humanoid and (b) Manipulator, see Figure 8) implemented in the Mu Jo Co physics simulator (Todorov et al. (2012)). |
| Dataset Splits | No | The paper evaluates on the Arcade Learning Environment and DeepMind Control Suite, which are benchmark environments, but does not explicitly provide percentages or counts for training, validation, or test splits of data within the paper itself. It describes training over time and evaluating performance on these environments. |
| Hardware Specification | Yes | Ape-X DQN 5 days 22800M 376 cores, 1 GPU a Tesla P100. We use 360 actor machines (each using one CPU core) to feed data into the replay memory as fast as they can generate it. |
| Software Dependencies | No | The algorithm is implemented using Tensor Flow (Abadi et al., 2016). We use a Centered RMSProp optimizer... Training uses the Adam optimizer (Kingma & Ba (2014)). No specific version numbers for these software components are provided. |
| Experiment Setup | Yes | Each actor i {0, ..., N 1} executes an ϵi-greedy policy where ϵi = ϵ1+ i N 1 α with ϵ = 0.4, α = 7. The episode length is limited to 50000 frames during training. The capacity of the shared experience replay memory is soft-limited to 2 million transitions... Data is sampled according to proportional prioritization, with a priority exponent of 0.6 and an importance sampling exponent set to 0.4. The learner waits for at least 50000 transitions to be accumulated in the replay before starting learning. We use a Centered RMSProp optimizer with a learning rate of 0.00025 / 4, decay of 0.95, epsilon of 1.5e-7, and no momentum to minimize the multi-step loss (with n = 3). Gradient norms are clipped to 40. The target network used in the loss calculation is copied from the online network every 2500 training batches. |