reproducibilityindex.ai

Temporal Difference Learning for Model Predictive Control

Authors: Nicklas A Hansen, Hao Su, Xiaolong Wang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on a variety of continuous control tasks from DMControl (Tassa et al., 2018) and Meta-World (Yu et al., 2019), where we find that our method achieves superior sample efficiency and asymptotic performance over prior model-based and model-free methods. In particular, our method solves Humanoid and Dog locomotion tasks with up to 38-dimensional continuous action spaces in as little as 1M environment steps (see Figure 1), and is trivially extended to match the state-of-the-art in image-based RL.
Researcher Affiliation	Academia	Nicklas Hansen 1 Xiaolong Wang * 1 Hao Su * 1. 1UC San Diego. Correspondence to: Nicklas Hansen <nihansen@ucsd.edu>.
Pseudocode	Yes	We summarize our framework in Figure 1 and Algorithm 1. Algorithm 1 TD-MPC (inference) Algorithm 2 TOLD (training) Additionally, Py Torch-like pseudo-code for training our TOLD model (codified version of Algorithm 2) is shown below:
Open Source Code	Yes	Code and videos are available at https: //nicklashansen.github.io/td-mpc.
Open Datasets	Yes	We evaluate TD-MPC with a TOLD model on a total of 92 diverse and challenging continuous control tasks from Deep Mind Control Suite (DMControl; Tassa et al. (2018)) and Meta-World v2 (Yu et al., 2019)...
Dataset Splits	No	The paper describes the datasets used and mentions 'training' but does not explicitly provide specific percentages or counts for training, validation, and test splits.
Hardware Specification	Yes	Methods are benchmarked on a single RTX3090 GPU.
Software Dependencies	No	The paper mentions 'Py Torch-like pseudo-code' and refers to specific implementations like 'Yarats & Kostrikov (2020)' (which is a PyTorch SAC implementation) and the 'rliable toolkit provided by Agarwal et al. (2021)'. However, it does not provide specific version numbers for these software components.
Experiment Setup	Yes	Implementation details. All components are deterministic and implemented using MLPs. We linearly anneal the exploration parameter ϵ of Πθ and πθ from 0.5 to 0.05 over the first 25k decision steps1. We use a planning horizon of H = 5, and sample trajectories using prioritized experience replay (Schaul et al., 2016) with priority scaled by the value loss. During planning, we plan for 6 iterations (8 for Dog; 12 for Humanoid), sampling N = 512 trajectories (+5% sampled from πθ), and we compute µ, σ parameters over the top-64 trajectories each iteration. For image-based tasks, observations are 3 stacked 84 84-dimensional RGB frames and we use 4 pixel shift augmentation (Kostrikov et al., 2020). Refer to Appendix F for additional details. Appendix F also includes 'Table 4. TD-MPC hyperparameters' which lists various parameters such as Discount factor, Learning rate, Batch size, etc.