Blending MPC & Value Function Approximation for Efficient Reinforcement Learning
Authors: Mohak Bhardwaj, Sanjiban Choudhury, Byron Boots
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a framework for improving on MPC with model-free reinforcement learning (RL). We further propose an algorithm that changes λ over time to reduce the dependence on MPC as our estimates of the value function improve, and test the efficacy our approach on challenging high-dimensional manipulation tasks with biased models in simulation. We demonstrate that our approach can obtain performance comparable with MPC with access to true dynamics even under severe model bias and is more sample efficient as compared to model-free RL. |
| Researcher Affiliation | Collaboration | Mohak Bhardwaj1 Sanjiban Choudhury2 Byron Boots1 1 University of Washington 2 Aurora Innovation Inc. |
| Pseudocode | Yes | Algorithm 1: MPQ(λ) |
| Open Source Code | No | The paper states: "We use publicly the available implementation at https://bit.ly/38Rc Drj for PPO." and "This environment was used without modification from the accompanying codebase for Rajeswaran* et al. (2018) and is available at https://bit.ly/3f6MNBP". These links are for a baseline (PPO) and an environment, respectively, not for the authors' own implementation of MPQ(λ). |
| Open Datasets | No | The paper describes various simulated robot control tasks (CARTPOLESWINGUP, SAWYERPEGINSERTION, INHANDMANIPULATION) which are environments where data is generated during interaction rather than pre-existing, publicly available datasets. No specific public dataset with access information is provided. |
| Dataset Splits | No | The paper describes validation as: "Validation is performed after every N training episodes during training for Neval episodes using a fixed set of start states that the environment is reset to. We ensure that the same start states are sampled at every validation iteration by setting the seed value to a pre-defined validation seed, which is kept constant across different runs of the algorithm with different training seeds. For all our experiments we set N =40 and Neval=30." This describes a validation *process* in a simulated environment, not a static dataset split for training/validation/test. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used to run the simulations or train the models. |
| Software Dependencies | No | The paper mentions using "ADAM (Kingma & Ba, 2014)" for optimization and "Mu Jo Co physics engine" for simulation, but it does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For all tasks, we represent Q function using 2 layered fully-connected neural network with 100 units in each layer and Re LU activation. We use ADAM (Kingma & Ba, 2014) for optimization with a learning rate of 0.001 and discount factor γ = 0.99. Further, the buffer size is 1500 for CARTPOLESWINGUP and 3000 for the others, with batch size of 64 for all. We smoothly decay λ according to the following sublinear decay rate... MPPI parameters Table 1 shows the MPPI parameters used for different experiments. |