Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control

Authors: Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, Igor Mordatch

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical evaluation, we wish to answer the following questions: 1. Does trajectory optimization in conjunction with uncertainty estimation in value function approximation result in temporally coordinated exploration strategies? 2. Can the use of an approximate value function help reduce the planning horizon for MPC? 3. Does trajectory optimization enable faster and more stable value function learning?
Researcher Affiliation Collaboration 1 University of Washington 2 Roboti LLC 3 Open AI
Pseudocode Yes The overall procedure is summarized in Algorithm 1. Algorithm 1 Plan Online and Learn Offline (POLO)
Open Source Code No The paper provides a link for video demonstrations: "Video demonstration of our results can be found at: https://sites.google.com/view/polo-mpc." (Figure 1), but it does not provide a link to or explicitly state the availability of its source code.
Open Datasets Yes The model used for the humanoid experiments was originally distributed with the Mu Jo Co (Todorov et al., 2012) software package and modified for our use. ... We use the Adroit hand model (Kumar, 2016) and build on top of the hand manipulation task suite of Rajeswaran et al. (2018).
Dataset Splits No The paper describes training procedures and hyperparameters but does not explicitly mention or detail specific train/validation/test dataset splits for its experiments. It describes data collection via agent experience but not how this is partitioned into distinct sets for validation.
Hardware Specification No The paper states "POLO requires only 12 CPU core hours and 96 seconds of agent experience." (Section 4), mentioning CPU usage time, but it does not specify any particular CPU models, GPU models, memory, or other hardware components used for its experiments.
Software Dependencies No The paper mentions using "Mu Jo Co (Todorov et al., 2012)" (Appendix A) and the "MPPI algorithm (Williams et al., 2016)" (Section 3.2), but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes For value function approximation in POLO for the humanoid tasks, we use an ensemble of 6 neural networks, each of which has 2 layers with 16 hidden parameters each; tanh is used for non-linearity. Training is performed with 64 gradient steps on minibatches of size 32, using ADAM with default parameters, every 16 timesteps the agent experiences. ... Value Parameter 0.99 γ discount Factor 64 Planning Horizon Length 120 MPPI Rollouts 0.2 MPPI Noise σ 1.25 MPPI Temperature.