Parallelizing Model-based Reinforcement Learning Over the Sequence Length
Authors: Zirui Wang, Yue DENG, Junfeng Long, Yin Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical results demonstrate that the PWM and PETE within Pa Mo RL significantly increase training speed without sacrificing inference efficiency. We evaluated our Pa Mo RL framework in the Atari 100K benchmark [16] and the Deep Mind Control suite [24]. The summarized experimental results are shown in Figure 1. |
| Researcher Affiliation | Collaboration | Zi Rui Wang Zhejiang University, China ziseoiwong@zju.edu.cn Yue Deng Zhejiang University, China devindeng@zju.edu.cn Junfeng Long Shanghai AI Laboratory, China junfengac@gmail.com Yin Zhang Zhejiang University, China zhangyin98@zju.edu.cn |
| Pseudocode | Yes | G Pytorch-style Pseudo-code of Parallel Scan. G.1 Odd-even scanner. G.2 Kogge-stone scanner. H Pytorch-style Pseudo-code of Parallelized Eligibility Trace Estimation. |
| Open Source Code | Yes | We provide sufficient information about the hyper-parameters as well as the details in the Appendix. We also pack our code in the supplemental materials. |
| Open Datasets | Yes | We evaluated our Pa Mo RL framework in the Atari 100K benchmark [16] and the Deep Mind Control suite [24]. |
| Dataset Splits | No | The paper specifies training sample budgets for Atari 100K and Deep Mind Control Suite (e.g., '100K samples', '50K training samples'), but does not explicitly describe train/validation/test dataset splits with percentages or counts for reproduction. It mentions 'training samples' but not how the data is split for validation or testing. |
| Hardware Specification | Yes | Among these methods, Dreamer V3 [17], and our Pa Mo RL are directly evaluated on an NVIDIA V100 GPU, and IRIS [20], TWM [21], and REM [25] are evaluated on an A100 GPU, while other methods are evaluated on a P100 GPU. Figure 5 shows PWM and PETE s runtime and GPU memory utilization on a single 3090 GPU. |
| Software Dependencies | No | The paper provides 'Pytorch-style Pseudo-code', implying the use of PyTorch, and mentions 'Adam' as an optimizer. However, it does not specify concrete version numbers for PyTorch or any other software libraries or dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Table 8: Full hyperparameters. Note that the environment will provide a done signal when losing a life but will continue running until the actual reset occurs. This life information configuration aligns with the setup used in IRIS [20]. Regarding data sampling, each time, we sample B1 trajectories of length T for world model training and sample B2 trajectories of length C for starting the imaginations. |