Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimators

Authors: Paavo Parmas, Takuma Seno, Yuma Aoki

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments in two settings: we replicate the cart-pole task in the original TP paper (PIPPS, App. B), we combine TPX with the state of the art visual MBRL algorithm Dreamer (Sec. 6) (Hafner et al., 2020), and apply it on continuous control from pixels in the DMC environments (Tunyasuvunakool et al., 2020). The results showed that TPX can reliably match TP while requiring less computation and being easier to implement.
Researcher Affiliation Collaboration 1Kyoto University 2Sony AI 3University of Tokyo. Correspondence to: Paavo Parmas <paavo@sys.i.kyoto-u.ac.jp>.
Pseudocode No The paper does not contain a pseudocode block or a clearly labeled algorithm block.
Open Source Code No The paper states 'Our implementation is a greatly modified version of https: //github.com/yusukeurakami/dreamer-pytorch.' but does not explicitly state that the code for their proposed method (TPX and its integration) is open-source or provide a direct link to their modified code.
Open Datasets Yes We evaluated the algorithms on eight continuous control Deep Mind Control Suite (Tunyasuvunakool et al., 2020) environments using the Mu Jo Co simulator: Cartpole Swingup, Cartpole Swingup Sparse, Cheetah Run, Cartpole Balance, Walker Walk, Finger Spin and Reacher Easy.
Dataset Splits No The paper does not provide specific train/validation/test dataset splits (percentages, sample counts, or references to predefined splits) needed to reproduce the data partitioning. It mentions sampling batches for training but not formal splits.
Hardware Specification Yes The standard experiments were run on a mix of Nvidia RTX 2080 Ti and V100 GPUs, while the pretrained experiments were run on a uniform setup of V100 GPUs.
Software Dependencies No The paper mentions software like 'Py Torch', 'Tensor Flow 2', and 'Mu Jo Co simulator' but does not provide specific version numbers for these software components.
Experiment Setup Yes We keep the hyperparameters and all settings the same as in the original Dreamer, but only swap the RP gradient estimators in the Dreamer policy gradient estimator to use TPX to test whether our algorithm can scale, and whether it adds any benefit. The algorithm samples, from the dataset, a batch of 50 sequences of 50 images (2500 data points in total)... The trajectories start from the 2500 encoded states, and they have length H. We perform experiments for simulation horizons H {15, 60}. We evaluate each experiment with 4 random seeds.