Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

Authors: Xiyao Wang, Wichayaporn Wongkamjan, Ruonan Jia, Furong Huang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a range of continuous control environments in Mu Jo Co show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.
Researcher Affiliation Academia 1Department of Computer Science, University of Maryland, College Park, MD 20742, USA 2Tsinghua University.
Pseudocode Yes Algorithm 1 Policy-adapted Dynamics Model Learning (PDML)
Open Source Code No We implement PDML-MBPO based on the Py Torch-version MBPO (Liu et al., 2020). - The paper states their method is based on an existing open-source implementation, but does not explicitly state that the code for PDML-MBPO itself is open-source or provide a link to it.
Open Datasets Yes We conduct experiment on six complex Mo Jo Co-v2 (Todorov et al., 2012) environments...
Dataset Splits No The paper mentions collecting 'real samples' and using an 'evaluation dataset' of 1000 N samples for error calculation, but does not specify how the total data is split into explicit training, validation, and test sets with percentages or counts for reproducibility.
Hardware Specification Yes All experiments are conducted using a single NVIDIA TITAN X Pascal GPU.
Software Dependencies Yes We conduct experiment on six complex Mo Jo Co-v2 (Todorov et al., 2012) environments...
Experiment Setup Yes We set the current policy proportion to be 0.02 and α equals 0.02/0.98. One thing that needs to be noticed is the rollout horizon setting. As introduced in MBPO (Janner et al., 2019), the rollout horizon should start at a short horizon and increase linearly with the interaction epoch. [a, b, x, y] denotes a thresholded linear function, i.e. at epoch e, rollout horizon is h = min(max(x + e−a / b−a(y−x), x), y). We set the rollout horizon to be the same as used in the MBPO paper, as shown in Table 5. Other hyper-parameter settings are shown in Table 6.