Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

Authors: Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher abdulhai, Scott Sanner

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by 116.4%, MORe L by 23.2% and COMBO by 23.7%. Further, CBOP achieves state-of-the-art performance on 11 out of 18 benchmark datasets while doing on par on the remaining datasets. We evaluate CBOP on the D4RL benchmark of continuous control tasks (Fu et al., 2020).
Researcher Affiliation Collaboration 1University of Toronto, 2LG AI Research, 3Vector Institute
Pseudocode Yes Algorithm 1 Conservative Bayesian MVE; Please see Algorithm 2 in Appendix B.1 for the full description of CBOP.
Open Source Code Yes We release our code at https://github.com/jihwan-jeong/CBOP.
Open Datasets Yes We evaluate these RQs on the standard D4RL offline RL benchmark (Fu et al., 2020).
Dataset Splits No The paper mentions using the D4RL benchmark but does not provide specific details on training, validation, or testing dataset splits, such as percentages, sample counts, or explicit instructions for replication.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or server configurations used to run the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., libraries, frameworks, or programming languages with their exact versions).
Experiment Setup Yes We use Adam optimizer with a learning rate of 1e-4 for all networks except the Q functions (3e-4) and a batch size of 256. For the D4RL experiments, we train for 1M gradient steps. ... For all experiments, we use discount factor γ = 0.99.