MOReL: Model-Based Offline Reinforcement Learning

Authors: Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments, we show that MORe L matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.
Researcher Affiliation Collaboration Rahul Kidambi Cornell University, Ithaca rkidambi@cornell.edu Aravind Rajeswaran University of Washington, Seattle Google Research, Brain Team aravraj@cs.washington.edu Praneeth Netrapalli Microsoft Research, India praneeth@microsoft.com Thorsten Joachims Cornell University, Ithaca tj@cornell.edu
Pseudocode Yes Algorithm 1 MORe L: Model Based Offline Reinforcement Learning
Open Source Code No Project webpage: https://sites.google.com/view/morel (The webpage states 'Code coming soon!' indicating it was not available at the time of publication.)
Open Datasets Yes The tasks considered include Hopper-v2, Half Cheetah-v2, Ant-v2, and Walker2d-v2, which are illustrated in Figure 2. We consider five different logged data-sets for each environment, totalling 20 environment-dataset combinations. Datasets are collected based on the work of Wu et al. [18], with each dataset containing the equivalent of 1 million timesteps of environment interaction.
Dataset Splits No The paper mentions using a 'static dataset of interactions' but does not specify training, validation, or test splits for this dataset in the traditional sense, as policies are evaluated via rollouts in the environment.
Hardware Specification No The paper mentions 'computing resources from the Cornell Graphite cluster' but does not provide specific details on CPU, GPU, or memory used for the experiments.
Software Dependencies No The paper mentions software like Open AI gym [73], Mu Jo Co [74], and Adam [68] optimizer, but does not provide specific version numbers for these or other libraries/frameworks.
Experiment Setup No The paper mentions using a '2-layer ReLU-MLPs' for dynamics models, a '2-layer tanh-MLP' for the policy, an 'ensemble of 4 dynamics models', and that results are averaged over '5 different random seeds' using 'the same hyperparameters'. However, it does not provide specific hyperparameter values like learning rate, batch size, or optimizer settings.