Temporal Difference Learning for Model Predictive Control
Authors: Nicklas A Hansen, Hao Su, Xiaolong Wang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on a variety of continuous control tasks from DMControl (Tassa et al., 2018) and Meta-World (Yu et al., 2019), where we find that our method achieves superior sample efficiency and asymptotic performance over prior model-based and model-free methods. In particular, our method solves Humanoid and Dog locomotion tasks with up to 38-dimensional continuous action spaces in as little as 1M environment steps (see Figure 1), and is trivially extended to match the state-of-the-art in image-based RL. |
| Researcher Affiliation | Academia | Nicklas Hansen 1 Xiaolong Wang * 1 Hao Su * 1. 1UC San Diego. Correspondence to: Nicklas Hansen <nihansen@ucsd.edu>. |
| Pseudocode | Yes | We summarize our framework in Figure 1 and Algorithm 1. Algorithm 1 TD-MPC (inference) Algorithm 2 TOLD (training) Additionally, Py Torch-like pseudo-code for training our TOLD model (codified version of Algorithm 2) is shown below: |
| Open Source Code | Yes | Code and videos are available at https: //nicklashansen.github.io/td-mpc. |
| Open Datasets | Yes | We evaluate TD-MPC with a TOLD model on a total of 92 diverse and challenging continuous control tasks from Deep Mind Control Suite (DMControl; Tassa et al. (2018)) and Meta-World v2 (Yu et al., 2019)... |
| Dataset Splits | No | The paper describes the datasets used and mentions 'training' but does not explicitly provide specific percentages or counts for training, validation, and test splits. |
| Hardware Specification | Yes | Methods are benchmarked on a single RTX3090 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch-like pseudo-code' and refers to specific implementations like 'Yarats & Kostrikov (2020)' (which is a PyTorch SAC implementation) and the 'rliable toolkit provided by Agarwal et al. (2021)'. However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Implementation details. All components are deterministic and implemented using MLPs. We linearly anneal the exploration parameter ϵ of Πθ and πθ from 0.5 to 0.05 over the first 25k decision steps1. We use a planning horizon of H = 5, and sample trajectories using prioritized experience replay (Schaul et al., 2016) with priority scaled by the value loss. During planning, we plan for 6 iterations (8 for Dog; 12 for Humanoid), sampling N = 512 trajectories (+5% sampled from πθ), and we compute µ, σ parameters over the top-64 trajectories each iteration. For image-based tasks, observations are 3 stacked 84 84-dimensional RGB frames and we use 4 pixel shift augmentation (Kostrikov et al., 2020). Refer to Appendix F for additional details. Appendix F also includes 'Table 4. TD-MPC hyperparameters' which lists various parameters such as Discount factor, Learning rate, Batch size, etc. |