Mismatched No More: Joint Model-Policy Optimization for Model-Based RL
Authors: Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, Russ R. Salakhutdinov
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical simulations demonstrate that optimizing this bound yields reward maximizing policies and yields dynamics that (perhaps surprisingly) can aid in exploration. We also show that a deep RL algorithm loosely based on our lower bound can achieve performance competitive with prior model-based methods, and better performance on certain hard exploration tasks. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, 2Google Brain, 3UC Berkeley |
| Pseudocode | Yes | Algorithm 1 Mismatched no More (Mn M) is an algorithm for model-based RL. The method alternates between training the policy on experience from the learned dynamics model with augmented rewards and updating the model+classifier using a GAN-like loss. While we use an off-policy RL algorithm on L4, any other RL algorithm can be substituted. |
| Open Source Code | No | Code will be released upon publications. |
| Open Datasets | Yes | We use three locomotion tasks from the Open AI Gym benchmark [7] to compare Mn M-approx to MBPO and VMBPO. ... We next use the ROBEL manipulation benchmark [1] to compare how Mn M-approx and MBPO handle tasks with more complicated dynamics. |
| Dataset Splits | Yes | We use the standard train/validation/test splits where applicable (e.g. for Metaworld, we use the splits provided by the benchmark). |
| Hardware Specification | No | Proprietary. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | For the Pusher-v2, Door-open-v2, and Dclaw tasks, we use a planning horizon of 1. For the Half Cheetah and Ant tasks, we use a horizon of 5. For all other tasks, we use a horizon of 2. We regularize the policy using an entropy coefficient of 0.1. We train the dynamics model for 200 epochs. We collect 5000 environment transitions for each training iteration. We use a batch size of 256 for all neural network updates. All neural networks are MLPs with 2 hidden layers and 256 units per layer. We use the Adam optimizer with a learning rate of 0.001. |