reproducibilityindex.ai

Mismatched No More: Joint Model-Policy Optimization for Model-Based RL

Authors: Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, Russ R. Salakhutdinov

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical simulations demonstrate that optimizing this bound yields reward maximizing policies and yields dynamics that (perhaps surprisingly) can aid in exploration. We also show that a deep RL algorithm loosely based on our lower bound can achieve performance competitive with prior model-based methods, and better performance on certain hard exploration tasks.
Researcher Affiliation	Collaboration	1Carnegie Mellon University, 2Google Brain, 3UC Berkeley
Pseudocode	Yes	Algorithm 1 Mismatched no More (Mn M) is an algorithm for model-based RL. The method alternates between training the policy on experience from the learned dynamics model with augmented rewards and updating the model+classiﬁer using a GAN-like loss. While we use an off-policy RL algorithm on L4, any other RL algorithm can be substituted.
Open Source Code	No	Code will be released upon publications.
Open Datasets	Yes	We use three locomotion tasks from the Open AI Gym benchmark [7] to compare Mn M-approx to MBPO and VMBPO. ... We next use the ROBEL manipulation benchmark [1] to compare how Mn M-approx and MBPO handle tasks with more complicated dynamics.
Dataset Splits	Yes	We use the standard train/validation/test splits where applicable (e.g. for Metaworld, we use the splits provided by the benchmark).
Hardware Specification	No	Proprietary.
Software Dependencies	No	The paper does not provide specific version numbers for any software libraries, frameworks, or environments used in the experiments.
Experiment Setup	Yes	For the Pusher-v2, Door-open-v2, and Dclaw tasks, we use a planning horizon of 1. For the Half Cheetah and Ant tasks, we use a horizon of 5. For all other tasks, we use a horizon of 2. We regularize the policy using an entropy coefﬁcient of 0.1. We train the dynamics model for 200 epochs. We collect 5000 environment transitions for each training iteration. We use a batch size of 256 for all neural network updates. All neural networks are MLPs with 2 hidden layers and 256 units per layer. We use the Adam optimizer with a learning rate of 0.001.