Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy
Authors: Xiyao Wang, Wichayaporn Wongkamjan, Ruonan Jia, Furong Huang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a range of continuous control environments in Mu Jo Co show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland, College Park, MD 20742, USA 2Tsinghua University. |
| Pseudocode | Yes | Algorithm 1 Policy-adapted Dynamics Model Learning (PDML) |
| Open Source Code | No | We implement PDML-MBPO based on the Py Torch-version MBPO (Liu et al., 2020). - The paper states their method is based on an existing open-source implementation, but does not explicitly state that the code for PDML-MBPO itself is open-source or provide a link to it. |
| Open Datasets | Yes | We conduct experiment on six complex Mo Jo Co-v2 (Todorov et al., 2012) environments... |
| Dataset Splits | No | The paper mentions collecting 'real samples' and using an 'evaluation dataset' of 1000 N samples for error calculation, but does not specify how the total data is split into explicit training, validation, and test sets with percentages or counts for reproducibility. |
| Hardware Specification | Yes | All experiments are conducted using a single NVIDIA TITAN X Pascal GPU. |
| Software Dependencies | Yes | We conduct experiment on six complex Mo Jo Co-v2 (Todorov et al., 2012) environments... |
| Experiment Setup | Yes | We set the current policy proportion to be 0.02 and α equals 0.02/0.98. One thing that needs to be noticed is the rollout horizon setting. As introduced in MBPO (Janner et al., 2019), the rollout horizon should start at a short horizon and increase linearly with the interaction epoch. [a, b, x, y] denotes a thresholded linear function, i.e. at epoch e, rollout horizon is h = min(max(x + e−a / b−a(y−x), x), y). We set the rollout horizon to be the same as used in the MBPO paper, as shown in Table 5. Other hyper-parameter settings are shown in Table 6. |