Evaluating Model-Based Planning and Planner Amortization for Continuous Control
Authors: Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We find that well-tuned model-free agents are strong baselines even for high Do F control problems but MPC with learned proposal distributions and models (both trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency in hard multi-task/multi-goal settings. |
| Researcher Affiliation | Collaboration | Deep Mind Equal contributions. Correspondence to {abyravan, leonardh}@google.com Work done at Deep Mind Max Planck Institute for Intelligent Systems, Tübingen, Germany and Computational and Biological Learning Group, University of Cambridge Facebook Reality Labs |
| Pseudocode | Yes | Algorithm 1 Agent combining MPC with model-free RL |
| Open Source Code | No | The paper mentions 'Videos of agents performing different tasks can be seen on our website.' but does not explicitly state that the source code for the methodology described in the paper is openly available or provide a direct link to a code repository. |
| Open Datasets | Yes | In this paper we consider a number of locomotion tasks of varying complexity, simulated with the Mu Jo Co (Todorov et al., 2012) physics simulator. We consider two embodiments: an 8-Do F ant from dm_control (Tunyasuvunakool et al., 2020) and a model of the Robotis OP3 robot with 20 degrees of freedom. [...] For the OP3, we use poses from the CMU mocap database (cmu) (retargeted to the robot). [...] The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. |
| Dataset Splits | No | The paper describes using a replay buffer for sampling and states batch sizes and sequence lengths ('We sample a batch of trajectories... where T is the trajectory length (set to 10 in all our experiments). This batch of trajectories is also used for policy and critic learning.'), but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions a 'distributed set-up with 64 actors' but does not provide specific details on the CPU, GPU, or other hardware components used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Mu Jo Co physics simulator,' 'dm_control,' and the 'reverb framework,' and 'Adam optimizer,' but it does not provide specific version numbers for these software components to ensure reproducibility. |
| Experiment Setup | Yes | We tune the BC objective weight β per task. [...] We choose pplan = 0.5 for all our MPC experiments...and use 250 samples and a planning horizon of 10 for SMC... We tuned each algorithmic variant independently per task, running sweeps over MPO hyper parameters and, where applicable, BC objective weight β and planner temperature τ. [...] We used ϵΣ = 10-5 and swept over 3 values from ϵµ = 5·10-4, 10-4, 5·10-3, 10-2 depending on the task. For algorithmic variants involving BC we ran hyper-parameter sweeps varying the BC objective weight β = 0.001, 0.01, 0.1, 1.0. For algorithmic variants involving MPC we ran sweeps over the planner temperature τ = 0.001, 0.01, 0.1... |