Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform
Authors: Shengyi Huang, Jiayi Weng, Rujikorn Charakorn, Min Lin, Zhongwen Xu, Santiago Ontanon
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our Atari experiments show that these variants can obtain equivalent or higher scores than strong IMPALA baselines in moolib and torchbeast and PPO baseline in Clean RL. However, Cleanba variants present 1) shorter training time and 2) more reproducible learning curves in different hardware settings. |
| Researcher Affiliation | Collaboration | Shengyi Huang Drexel University Hugging Face Google ]VISTEC 4Sea AI Lab }Tencent AI Lab costa.huang@outlook.com |
| Pseudocode | Yes | Figure 1: The pseudocode for IMPALA architecture (left) and Cleanba s architecture (right). |
| Open Source Code | Yes | Cleanba s source code is available at https://github.com/vwxyzjn/cleanba. |
| Open Datasets | Yes | We perform experiments on Atari games (Bellemare et al., 2013). |
| Dataset Splits | No | The paper states experiments ran for 200M frames with three random seeds on Atari games, but does not explicitly provide specific train/validation/test dataset split percentages or sample counts. |
| Hardware Specification | Yes | To make a more direct and fair comparison, we used the same AWS p4d.24xlarge instances3 and the same Atari environment simulation setups via Env Pool and compared only the following codebase settings... 1. Base experiments uses 10 CPU and 1 A100 setting as a base comparison; 2. Workstation experiments uses 46 CPU and 8 A100s for Cleanba experiments, 80 CPU and 8 A100s for moolib experiments5, and 80 CPU and 1 A100 for monobeast experiments. |
| Software Dependencies | No | The paper states that 'Cleanba's implementation uses JAX (Bradbury et al., 2018) and Env Pool (Weng et al., 2022)', and that 'The dependencies of the experiments are pinned', but does not explicitly list specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | All experiments used 84 84 images with greyscale, an action repeat of 4, 4 stacked frames, and a maximum of 108,000 frames per episode. We followed the recommended Atari evaluation protocol by Machado et al. (2018), which used sticky action with a probability of 25%, no loss of life signal, and the full action space... Throughout all experiments, the agents used IMPALA s Resnet architecture (Espeholt et al., 2018), ran for 200M frames with three random seeds. The hyperparameters and the learning curves can be found in Appendix B. |