Temporal Difference Learning for Model Predictive Control

Authors: Nicklas A Hansen, Hao Su, Xiaolong Wang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on a variety of continuous control tasks from DMControl (Tassa et al., 2018) and Meta-World (Yu et al., 2019), where we find that our method achieves superior sample efficiency and asymptotic performance over prior model-based and model-free methods. In particular, our method solves Humanoid and Dog locomotion tasks with up to 38-dimensional continuous action spaces in as little as 1M environment steps (see Figure 1), and is trivially extended to match the state-of-the-art in image-based RL.
Researcher Affiliation Academia Nicklas Hansen 1 Xiaolong Wang * 1 Hao Su * 1. 1UC San Diego. Correspondence to: Nicklas Hansen <nihansen@ucsd.edu>.
Pseudocode Yes We summarize our framework in Figure 1 and Algorithm 1. Algorithm 1 TD-MPC (inference) Algorithm 2 TOLD (training) Additionally, Py Torch-like pseudo-code for training our TOLD model (codified version of Algorithm 2) is shown below:
Open Source Code Yes Code and videos are available at https: //nicklashansen.github.io/td-mpc.
Open Datasets Yes We evaluate TD-MPC with a TOLD model on a total of 92 diverse and challenging continuous control tasks from Deep Mind Control Suite (DMControl; Tassa et al. (2018)) and Meta-World v2 (Yu et al., 2019)...
Dataset Splits No The paper describes the datasets used and mentions 'training' but does not explicitly provide specific percentages or counts for training, validation, and test splits.
Hardware Specification Yes Methods are benchmarked on a single RTX3090 GPU.
Software Dependencies No The paper mentions 'Py Torch-like pseudo-code' and refers to specific implementations like 'Yarats & Kostrikov (2020)' (which is a PyTorch SAC implementation) and the 'rliable toolkit provided by Agarwal et al. (2021)'. However, it does not provide specific version numbers for these software components.
Experiment Setup Yes Implementation details. All components are deterministic and implemented using MLPs. We linearly anneal the exploration parameter ϵ of Πθ and πθ from 0.5 to 0.05 over the first 25k decision steps1. We use a planning horizon of H = 5, and sample trajectories using prioritized experience replay (Schaul et al., 2016) with priority scaled by the value loss. During planning, we plan for 6 iterations (8 for Dog; 12 for Humanoid), sampling N = 512 trajectories (+5% sampled from πθ), and we compute µ, σ parameters over the top-64 trajectories each iteration. For image-based tasks, observations are 3 stacked 84 84-dimensional RGB frames and we use 4 pixel shift augmentation (Kostrikov et al., 2020). Refer to Appendix F for additional details. Appendix F also includes 'Table 4. TD-MPC hyperparameters' which lists various parameters such as Discount factor, Learning rate, Batch size, etc.