RLlib: Abstractions for Distributed Reinforcement Learning

Authors: Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, Ion Stoica

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Evaluation Sampling efficiency: Policy evaluation is an important building block for all RL algorithms. In Figure 7 we benchmark the scalability of gathering samples from policy evaluator actors. ... Large-scale tests: We evaluate the performance of RLlib on Evolution Strategies (ES), Proximal Policy Optimization (PPO), and A3C, comparing against specialized systems built specifically for those algorithms (Open AI, 2017; Hesse et al., 2017; Open AI, 2016) using Redis, Open MPI, and Distributed Tensor Flow.
Researcher Affiliation Academia Eric Liang * 1 Richard Liaw * 1 Philipp Moritz 1 Robert Nishihara 1 Roy Fox 1 Ken Goldberg 1 Joseph E. Gonzalez 1 Michael I. Jordan 1 Ion Stoica 1 1University of California, Berkeley. Correspondence to: Eric Liang <ericliang@berkeley.edu>.
Pseudocode Yes Figure 4. Pseudocode for four RLlib policy optimizer step methods.
Open Source Code Yes RLlib is available as part of the open source Ray project 1.
Open Datasets Yes RLlib supports Open AI Gym (Brockman et al., 2016), user-defined environments, and also batched simulators such as ELF (Tian et al., 2017):
Dataset Splits No The paper mentions various environments and tasks (e.g., Open AI Gym, Humanoid-v1, Pong Deterministic-v4) which often have standard splits, but it does not explicitly state the dataset splits (percentages, sample counts, or specific split files) used for training, validation, or testing within the text.
Hardware Specification Yes RLlib s ES implementation scales well on the Humanoidv1 task to 8192 cores using AWS m4.16xl CPU instances (Amazon, 2017). For PPO we evaluate on the same Humanoid-v1 task, starting with one p2.16xl GPU instance and adding m4.16xl instances to scale. We ran RLlib s A3C on an x1.16xl machine and solved the Pong Deterministic-v4 environment in 12 minutes...
Software Dependencies No The paper mentions software like 'Tensor Flow' and 'Py Torch' as deep learning frameworks that can be used, and states 'We used Tensor Flow to define neural networks for the RLlib algorithms evaluated'. However, it does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup Yes The PPO batch size was 320k, The SGD batch size was 32k, and we used 20 SGD passes per PPO batch. (Table 3) Our implementation scales nearly linearly up to 160k environment frames per second with 256 workers at a frameskip of 4 (Section 3.4). Evaluators compute actions for 64 agents at a time, and share the GPUs on the machine. (Figure 7 caption)