Revisiting Design Choices in Offline Model Based Reinforcement Learning

Authors: Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, Stephen J. Roberts

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations that are vastly different to those currently used in existing hand-tuned state-of-the-art methods, and result in drastically stronger performance.
Researcher Affiliation Academia Cong Lu , Philip J. Ball , Jack Parker-Holder, Michael A. Osborne, Stephen J. Roberts Department of Engineering University of Oxford
Pseudocode No The paper describes algorithms and methods in prose, but it does not include any formal pseudocode blocks or algorithm listings.
Open Source Code No The paper mentions that 'The D4RL (Fu et al., 2021a) codebase and datasets used for the empirical evaluation is available under the CC BY 4.0 Licence.', but this refers to a third-party dataset and codebase, not the authors' own implementation of their methodology or experiments.
Open Datasets Yes Using D4RL (Fu et al., 2021a), we train models on each dataset, then evaluate them on other datasets from the same environment, but collected under different policies.
Dataset Splits No The paper uses the D4RL datasets and refers to 'train' and 'test' scenarios, but it does not explicitly provide details about specific training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit mention of validation set usage for hyperparameter tuning) beyond using benchmark datasets.
Hardware Specification Yes Each BO iteration is run for 300 epochs on a single seed, and the full optimization over an offline dataset took ~200 hours on a NVIDIA Ge Force GTX 1080 Ti GPU taken up predominantly by MOPO training.
Software Dependencies No The paper mentions using Python and specific algorithms/frameworks like SAC and Bayesian Optimization (CASMOPOLITAN), but it does not provide specific version numbers for any software dependencies (e.g., Python version, PyTorch version, or specific library versions).
Experiment Setup Yes We define our search space over hyperparameters most related to uncertainty quantification: Penalty type (categorical): taking values over {Max Aleatoric, Max Pairwise Diff, LOO KL, LL Var, Ensemble Std, Ensemble Variance}. Penalty scale λ (continuous): taking values over [1, 100]. h (integer): taking values over {1, 2, . . . , 50}. Models N (integer): taking values over {1, 2, . . . , 15}.