Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimators
Authors: Paavo Parmas, Takuma Seno, Yuma Aoki
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments in two settings: we replicate the cart-pole task in the original TP paper (PIPPS, App. B), we combine TPX with the state of the art visual MBRL algorithm Dreamer (Sec. 6) (Hafner et al., 2020), and apply it on continuous control from pixels in the DMC environments (Tunyasuvunakool et al., 2020). The results showed that TPX can reliably match TP while requiring less computation and being easier to implement. |
| Researcher Affiliation | Collaboration | 1Kyoto University 2Sony AI 3University of Tokyo. Correspondence to: Paavo Parmas <paavo@sys.i.kyoto-u.ac.jp>. |
| Pseudocode | No | The paper does not contain a pseudocode block or a clearly labeled algorithm block. |
| Open Source Code | No | The paper states 'Our implementation is a greatly modified version of https: //github.com/yusukeurakami/dreamer-pytorch.' but does not explicitly state that the code for their proposed method (TPX and its integration) is open-source or provide a direct link to their modified code. |
| Open Datasets | Yes | We evaluated the algorithms on eight continuous control Deep Mind Control Suite (Tunyasuvunakool et al., 2020) environments using the Mu Jo Co simulator: Cartpole Swingup, Cartpole Swingup Sparse, Cheetah Run, Cartpole Balance, Walker Walk, Finger Spin and Reacher Easy. |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits (percentages, sample counts, or references to predefined splits) needed to reproduce the data partitioning. It mentions sampling batches for training but not formal splits. |
| Hardware Specification | Yes | The standard experiments were run on a mix of Nvidia RTX 2080 Ti and V100 GPUs, while the pretrained experiments were run on a uniform setup of V100 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Py Torch', 'Tensor Flow 2', and 'Mu Jo Co simulator' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We keep the hyperparameters and all settings the same as in the original Dreamer, but only swap the RP gradient estimators in the Dreamer policy gradient estimator to use TPX to test whether our algorithm can scale, and whether it adds any benefit. The algorithm samples, from the dataset, a batch of 50 sequences of 50 images (2500 data points in total)... The trajectories start from the 2500 encoded states, and they have length H. We perform experiments for simulation horizons H {15, 60}. We evaluate each experiment with 4 random seeds. |