reproducibilityindex.ai

Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimators

Authors: Paavo Parmas, Takuma Seno, Yuma Aoki

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments in two settings: we replicate the cart-pole task in the original TP paper (PIPPS, App. B), we combine TPX with the state of the art visual MBRL algorithm Dreamer (Sec. 6) (Hafner et al., 2020), and apply it on continuous control from pixels in the DMC environments (Tunyasuvunakool et al., 2020). The results showed that TPX can reliably match TP while requiring less computation and being easier to implement.
Researcher Affiliation	Collaboration	1Kyoto University 2Sony AI 3University of Tokyo. Correspondence to: Paavo Parmas <paavo@sys.i.kyoto-u.ac.jp>.
Pseudocode	No	The paper does not contain a pseudocode block or a clearly labeled algorithm block.
Open Source Code	No	The paper states 'Our implementation is a greatly modiﬁed version of https: //github.com/yusukeurakami/dreamer-pytorch.' but does not explicitly state that the code for their proposed method (TPX and its integration) is open-source or provide a direct link to their modified code.
Open Datasets	Yes	We evaluated the algorithms on eight continuous control Deep Mind Control Suite (Tunyasuvunakool et al., 2020) environments using the Mu Jo Co simulator: Cartpole Swingup, Cartpole Swingup Sparse, Cheetah Run, Cartpole Balance, Walker Walk, Finger Spin and Reacher Easy.
Dataset Splits	No	The paper does not provide specific train/validation/test dataset splits (percentages, sample counts, or references to predefined splits) needed to reproduce the data partitioning. It mentions sampling batches for training but not formal splits.
Hardware Specification	Yes	The standard experiments were run on a mix of Nvidia RTX 2080 Ti and V100 GPUs, while the pretrained experiments were run on a uniform setup of V100 GPUs.
Software Dependencies	No	The paper mentions software like 'Py Torch', 'Tensor Flow 2', and 'Mu Jo Co simulator' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We keep the hyperparameters and all settings the same as in the original Dreamer, but only swap the RP gradient estimators in the Dreamer policy gradient estimator to use TPX to test whether our algorithm can scale, and whether it adds any beneﬁt. The algorithm samples, from the dataset, a batch of 50 sequences of 50 images (2500 data points in total)... The trajectories start from the 2500 encoded states, and they have length H. We perform experiments for simulation horizons H {15, 60}. We evaluate each experiment with 4 random seeds.