reproducibilityindex.ai

RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

Authors: Eric Liang, Zhanghao Wu, Michael Luo, Sven Mika, Joseph E. Gonzalez, Ion Stoica

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we re-examine the challenges posed by distributed RL and try to view it through the lens of an old idea: distributed dataﬂow. We show that viewing RL as a dataﬂow problem leads to highly composable and performant implementations. We propose RLlib Flow, a hybrid actor-dataﬂow programming model for distributed RL, and validate its practicality by porting the full suite of algorithms in RLlib, a widely adopted distributed RL library. Concretely, RLlib Flow provides 2-9 code savings in real production code and enables the composition of multi-agent algorithms not possible by end users before. The open-source code is available as part of RLlib at https://github.com/ ray-project/ray/tree/master/rllib.
Researcher Affiliation	Collaboration	Eric Liang UC Berkeley Zhanghao Wu UC Berkeley Michael Luo UC Berkeley Sven Mika Anyscale Joseph E. Gonzalez UC Berkeley Ion Stoica UC Berkeley
Pseudocode	Yes	Figure 9: Comparing the implementation of asynchronous optimization in RLlib Flow vs RLlib. (a) The entire A3C dataﬂow in RLlib Flow.
Open Source Code	Yes	The open-source code is available as part of RLlib at https://github.com/ ray-project/ray/tree/master/rllib.
Open Datasets	Yes	Sampling Microbenchmark: We evaluate the data throughput of RLlib Flow in isolation by running RL training with a dummy policy (with only one trainable scalar). Figure 13a shows that RLlib Flow achieves slightly better throughput due to small optimizations such as batched RPC wait, which are easy to implement across multiple algorithms in a common way in RLlib Flow. IMPALA Throughput: In Figure 13b we benchmark IMPALA, one of RLlib s high-throughput RL algorithms, and show that RLlib Flow achieves similar or better end-to-end performance. Performance of Multi-Agent Multi-Policy Workﬂow: In Figure 14, we show that the workﬂow of the two-trainer example (Figure 11a) achieves close to the theoretical best performance possible combining the two workﬂows (calculated via Amdahl s law). This benchmark was run in a multi-agent Atari environment with four agents per policy, and shows RLlib Flow can be practically used to compose complex training workﬂows. Comparison to Spark Streaming: Distributed dataﬂow systems such as Spark Streaming [30] and Flink [1] are designed for collecting and transforming live data streams from online applications (e.g., event streams, social media). Given the basic map and reduce operations, we can implement synchronous RL algorithms in any of these streaming frameworks. However, without consideration for the requirements of RL tasks (Section 3), these frameworks can introduce signiﬁcant overheads. In Figure 15 we compare the performance of PPO implemented in Spark Streaming and RLlib Flow. Implementation details are in in Appendix A.1.
Dataset Splits	No	The paper mentions using environments like 'Cart Pole' and 'Atari' but does not specify any train/validation/test splits for these. RL environments typically involve continuous interaction rather than fixed dataset splits.
Hardware Specification	Yes	For all the experiments, we use a cluster with an AWS p3.16xlarge GPU head instance with additional m4.16xlarge worker instances. All machines have 64 v CPUs and are connected by a 25Gbps network.
Software Dependencies	No	The paper mentions software like 'Ray distributed actor framework', 'Apache Storm', 'Apache Flink', and 'Apache Spark', but it does not specify exact version numbers for any of these. Therefore, it does not provide specific ancillary software details for replication.
Experiment Setup	No	The paper discusses the design and implementation of RLlib Flow and how it enables composition of algorithms, mentioning general configuration options like 'batch size' or 'degree of parallelism'. However, it does not provide specific hyperparameter values (e.g., learning rates, specific optimizers, number of training steps/epochs) used for the experiments and benchmarks shown in Section 6.