reproducibilityindex.ai

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

Authors: Xiyao Wang, Wichayaporn Wongkamjan, Ruonan Jia, Furong Huang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on a range of continuous control environments in Mu Jo Co show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.
Researcher Affiliation	Academia	1Department of Computer Science, University of Maryland, College Park, MD 20742, USA 2Tsinghua University.
Pseudocode	Yes	Algorithm 1 Policy-adapted Dynamics Model Learning (PDML)
Open Source Code	No	We implement PDML-MBPO based on the Py Torch-version MBPO (Liu et al., 2020). - The paper states their method is based on an existing open-source implementation, but does not explicitly state that the code for PDML-MBPO itself is open-source or provide a link to it.
Open Datasets	Yes	We conduct experiment on six complex Mo Jo Co-v2 (Todorov et al., 2012) environments...
Dataset Splits	No	The paper mentions collecting 'real samples' and using an 'evaluation dataset' of 1000 N samples for error calculation, but does not specify how the total data is split into explicit training, validation, and test sets with percentages or counts for reproducibility.
Hardware Specification	Yes	All experiments are conducted using a single NVIDIA TITAN X Pascal GPU.
Software Dependencies	Yes	We conduct experiment on six complex Mo Jo Co-v2 (Todorov et al., 2012) environments...
Experiment Setup	Yes	We set the current policy proportion to be 0.02 and α equals 0.02/0.98. One thing that needs to be noticed is the rollout horizon setting. As introduced in MBPO (Janner et al., 2019), the rollout horizon should start at a short horizon and increase linearly with the interaction epoch. [a, b, x, y] denotes a thresholded linear function, i.e. at epoch e, rollout horizon is h = min(max(x + e−a / b−a(y−x), x), y). We set the rollout horizon to be the same as used in the MBPO paper, as shown in Table 5. Other hyper-parameter settings are shown in Table 6.