Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Temporal Difference Learning for Model Predictive Control
Authors: Nicklas A Hansen, Hao Su, Xiaolong Wang
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on a variety of continuous control tasks from DMControl (Tassa et al., 2018) and Meta-World (Yu et al., 2019), where we find that our method achieves superior sample efficiency and asymptotic performance over prior model-based and model-free methods. In particular, our method solves Humanoid and Dog locomotion tasks with up to 38-dimensional continuous action spaces in as little as 1M environment steps (see Figure 1), and is trivially extended to match the state-of-the-art in image-based RL. |
| Researcher Affiliation | Academia | Nicklas Hansen 1 Xiaolong Wang * 1 Hao Su * 1. 1UC San Diego. Correspondence to: Nicklas Hansen <EMAIL>. |
| Pseudocode | Yes | We summarize our framework in Figure 1 and Algorithm 1. Algorithm 1 TD-MPC (inference) Algorithm 2 TOLD (training) Additionally, Py Torch-like pseudo-code for training our TOLD model (codified version of Algorithm 2) is shown below: |
| Open Source Code | Yes | Code and videos are available at https: //nicklashansen.github.io/td-mpc. |
| Open Datasets | Yes | We evaluate TD-MPC with a TOLD model on a total of 92 diverse and challenging continuous control tasks from Deep Mind Control Suite (DMControl; Tassa et al. (2018)) and Meta-World v2 (Yu et al., 2019)... |
| Dataset Splits | No | The paper describes the datasets used and mentions 'training' but does not explicitly provide specific percentages or counts for training, validation, and test splits. |
| Hardware Specification | Yes | Methods are benchmarked on a single RTX3090 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch-like pseudo-code' and refers to specific implementations like 'Yarats & Kostrikov (2020)' (which is a PyTorch SAC implementation) and the 'rliable toolkit provided by Agarwal et al. (2021)'. However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Implementation details. All components are deterministic and implemented using MLPs. We linearly anneal the exploration parameter ϵ of Πθ and πθ from 0.5 to 0.05 over the first 25k decision steps1. We use a planning horizon of H = 5, and sample trajectories using prioritized experience replay (Schaul et al., 2016) with priority scaled by the value loss. During planning, we plan for 6 iterations (8 for Dog; 12 for Humanoid), sampling N = 512 trajectories (+5% sampled from πθ), and we compute µ, σ parameters over the top-64 trajectories each iteration. For image-based tasks, observations are 3 stacked 84 84-dimensional RGB frames and we use 4 pixel shift augmentation (Kostrikov et al., 2020). Refer to Appendix F for additional details. Appendix F also includes 'Table 4. TD-MPC hyperparameters' which lists various parameters such as Discount factor, Learning rate, Batch size, etc. |