reproducibilityindex.ai

Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform

Authors: Shengyi Huang, Jiayi Weng, Rujikorn Charakorn, Min Lin, Zhongwen Xu, Santiago Ontanon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our Atari experiments show that these variants can obtain equivalent or higher scores than strong IMPALA baselines in moolib and torchbeast and PPO baseline in Clean RL. However, Cleanba variants present 1) shorter training time and 2) more reproducible learning curves in different hardware settings.
Researcher Affiliation	Collaboration	Shengyi Huang Drexel University Hugging Face Google ]VISTEC 4Sea AI Lab }Tencent AI Lab costa.huang@outlook.com
Pseudocode	Yes	Figure 1: The pseudocode for IMPALA architecture (left) and Cleanba s architecture (right).
Open Source Code	Yes	Cleanba s source code is available at https://github.com/vwxyzjn/cleanba.
Open Datasets	Yes	We perform experiments on Atari games (Bellemare et al., 2013).
Dataset Splits	No	The paper states experiments ran for 200M frames with three random seeds on Atari games, but does not explicitly provide specific train/validation/test dataset split percentages or sample counts.
Hardware Specification	Yes	To make a more direct and fair comparison, we used the same AWS p4d.24xlarge instances3 and the same Atari environment simulation setups via Env Pool and compared only the following codebase settings... 1. Base experiments uses 10 CPU and 1 A100 setting as a base comparison; 2. Workstation experiments uses 46 CPU and 8 A100s for Cleanba experiments, 80 CPU and 8 A100s for moolib experiments5, and 80 CPU and 1 A100 for monobeast experiments.
Software Dependencies	No	The paper states that 'Cleanba's implementation uses JAX (Bradbury et al., 2018) and Env Pool (Weng et al., 2022)', and that 'The dependencies of the experiments are pinned', but does not explicitly list specific version numbers for these or other software dependencies.
Experiment Setup	Yes	All experiments used 84 84 images with greyscale, an action repeat of 4, 4 stacked frames, and a maximum of 108,000 frames per episode. We followed the recommended Atari evaluation protocol by Machado et al. (2018), which used sticky action with a probability of 25%, no loss of life signal, and the full action space... Throughout all experiments, the agents used IMPALA s Resnet architecture (Espeholt et al., 2018), ran for 200M frames with three random seeds. The hyperparameters and the learning curves can be found in Appendix B.