MOPO: Model-based Offline Policy Optimization

Authors: Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y. Zou, Sergey Levine, Chelsea Finn, Tengyu Ma

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we aim to study the follow questions: (1) How does MOPO perform on standard offline RL benchmarks in comparison to prior state-of-the-art approaches? (2) Can MOPO solve tasks that require generalization to out-of-distribution behaviors? (3) How does each component in MOPO affect performance?
Researcher Affiliation Academia Stanford University1, UC Berkeley2 {tianheyu,gwthomas}@cs.stanford.edu
Pseudocode Yes Algorithm 1 Framework for Model-based Offline Policy Optimization (MOPO) with Reward Penalty
Open Source Code Yes The code is available online5. 5Code is released at https://github.com/tianheyu927/mopo.
Open Datasets Yes To answer question (1), we evaluate our method on a large subset of datasets in the D4RL benchmark [18] based on the Mu Jo Co simulator [69], including three environments (halfcheetah, hopper, and walker2d) and four dataset types (random, medium, mixed, medium-expert), yielding a total of 12 problem settings.
Dataset Splits No The paper describes the types of datasets used from the D4RL benchmark (random, medium, mixed, medium-expert) but does not provide specific train/validation/test split percentages or counts for reproduction.
Hardware Specification No For more details on the experimental set-up and hyperparameters, see Appendix G.
Software Dependencies No The paper mentions the use of 'Mu Jo Co simulator' but does not provide specific software names with version numbers.
Experiment Setup Yes For more details on the experimental set-up and hyperparameters, see Appendix G.