Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Authors: Vitchyr Pong*, Shixiang Gu*, Murtaza Dalal, Sergey Levine

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods. Our empirical experiments demonstrate that this method achieves substantially better sample complexity than fully model-free learning on a range of challenging continuous control tasks, while outperforming purely model-based methods in terms of final performance. Our experiments examine how the sample efficiency and performance of TDMs compare to both model-based and model-free RL algorithms.
Researcher Affiliation Collaboration Vitchyr Pong University of California, Berkeley vitchyr@berkeley.edu Shixiang Gu University of Cambridge Max Planck Institute Google Brain sg717@cam.ac.uk Murtaza Dalal University of California, Berkeley mdalal@berkeley.edu Sergey Levine University of California, Berkeley svlevine@eecs.berkeley.edu
Pseudocode Yes The algorithm is summarized as Algorithm 1.
Open Source Code No The paper does not include an explicit statement about releasing its source code or provide a direct link to a code repository for the methodology described.
Open Datasets No The paper uses MuJoCo physics simulator (Todorov et al., 2012) and Open AI Gym environments (Brockman et al., 2016) for its simulated tasks. While these are open environments, the paper does not specify or provide access to a predefined public dataset with train/validation/test splits, as is common in supervised learning.
Dataset Splits No The paper does not specify exact train/validation/test splits for any dataset used. It describes hyperparameter tuning and performance evaluation on collected experience during reinforcement learning, but not a traditional dataset split.
Hardware Specification No The paper mentions using a '7-Do F Sawyer robotic arm' for real-world experiments, which is a specific robot, but it does not provide any specific details about the computing hardware (e.g., CPU/GPU models, memory) used to run the training and experiments.
Software Dependencies No The paper mentions using DDPG (Lillicrap et al., 2015) and Adam (Kingma & Ba, 2014) as optimizers, and MuJoCo and OpenAI Gym as environments. However, it does not specify any version numbers for these software components or libraries.
Experiment Setup Yes Experience replay (Mnih et al., 2015) has size of 1 million transitions, and the soft target networks (Lillicrap et al., 2015) are used with a polyak averaging coefficient of 0.999 for DDPG and TDM and 0.95 for HER and DDPG-Sparse. Learning rates of the critic and the actor are chosen from {1e-4, 1e-3} and {1e-4, 1e-3} respectively. Adam (Kingma & Ba, 2014) is used as the base optimizer with default parameters except the learning rate. The batch size was 128. The policies and networks are parmaeterized with neural networks with Re LU hidden activation and two hidden layers of size 300 and 300. For TDMs, we found the most important hyperparameters to be the reward scale, τmax, and the number of updates per observations, I. For all the model-free algorithms (DDPG, DDPG-Sparse, HER, and TDMs), we performed a grid search over the reward scale in the range {0.01, 1, 100, 10000} and the number of updates per observations in the range {1, 5, 10}. For TDMs, we also tuned the best τmax in the range {15, 25, Horizon 1}. For the half cheetah task, we performed extra searches over τmax and found τmax = 9 to be effective.