RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

Authors: Eric Liang, Zhanghao Wu, Michael Luo, Sven Mika, Joseph E. Gonzalez, Ion Stoica

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we re-examine the challenges posed by distributed RL and try to view it through the lens of an old idea: distributed dataflow. We show that viewing RL as a dataflow problem leads to highly composable and performant implementations. We propose RLlib Flow, a hybrid actor-dataflow programming model for distributed RL, and validate its practicality by porting the full suite of algorithms in RLlib, a widely adopted distributed RL library. Concretely, RLlib Flow provides 2-9 code savings in real production code and enables the composition of multi-agent algorithms not possible by end users before. The open-source code is available as part of RLlib at https://github.com/ ray-project/ray/tree/master/rllib.
Researcher Affiliation Collaboration Eric Liang UC Berkeley Zhanghao Wu UC Berkeley Michael Luo UC Berkeley Sven Mika Anyscale Joseph E. Gonzalez UC Berkeley Ion Stoica UC Berkeley
Pseudocode Yes Figure 9: Comparing the implementation of asynchronous optimization in RLlib Flow vs RLlib. (a) The entire A3C dataflow in RLlib Flow.
Open Source Code Yes The open-source code is available as part of RLlib at https://github.com/ ray-project/ray/tree/master/rllib.
Open Datasets Yes Sampling Microbenchmark: We evaluate the data throughput of RLlib Flow in isolation by running RL training with a dummy policy (with only one trainable scalar). Figure 13a shows that RLlib Flow achieves slightly better throughput due to small optimizations such as batched RPC wait, which are easy to implement across multiple algorithms in a common way in RLlib Flow. IMPALA Throughput: In Figure 13b we benchmark IMPALA, one of RLlib s high-throughput RL algorithms, and show that RLlib Flow achieves similar or better end-to-end performance. Performance of Multi-Agent Multi-Policy Workflow: In Figure 14, we show that the workflow of the two-trainer example (Figure 11a) achieves close to the theoretical best performance possible combining the two workflows (calculated via Amdahl s law). This benchmark was run in a multi-agent Atari environment with four agents per policy, and shows RLlib Flow can be practically used to compose complex training workflows. Comparison to Spark Streaming: Distributed dataflow systems such as Spark Streaming [30] and Flink [1] are designed for collecting and transforming live data streams from online applications (e.g., event streams, social media). Given the basic map and reduce operations, we can implement synchronous RL algorithms in any of these streaming frameworks. However, without consideration for the requirements of RL tasks (Section 3), these frameworks can introduce significant overheads. In Figure 15 we compare the performance of PPO implemented in Spark Streaming and RLlib Flow. Implementation details are in in Appendix A.1.
Dataset Splits No The paper mentions using environments like 'Cart Pole' and 'Atari' but does not specify any train/validation/test splits for these. RL environments typically involve continuous interaction rather than fixed dataset splits.
Hardware Specification Yes For all the experiments, we use a cluster with an AWS p3.16xlarge GPU head instance with additional m4.16xlarge worker instances. All machines have 64 v CPUs and are connected by a 25Gbps network.
Software Dependencies No The paper mentions software like 'Ray distributed actor framework', 'Apache Storm', 'Apache Flink', and 'Apache Spark', but it does not specify exact version numbers for any of these. Therefore, it does not provide specific ancillary software details for replication.
Experiment Setup No The paper discusses the design and implementation of RLlib Flow and how it enables composition of algorithms, mentioning general configuration options like 'batch size' or 'degree of parallelism'. However, it does not provide specific hyperparameter values (e.g., learning rates, specific optimizers, number of training steps/epochs) used for the experiments and benchmarks shown in Section 6.