OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning

Authors: Fan Wu, Rui Zhang, Qi Yi, Yunkai Gao, Jiaming Guo, Shaohui Peng, Siming Lan, Husheng Han, Yansong Pan, Kaizhao Yuan, Pengwei Jin, Ruizhi Chen, Yunji Chen, Ling Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results of our method on the D4RL Mu Jo Co benchmark show that OCEAN significantly improves the performance of existing algorithms. Our experiments aim to: a) evaluate how well OCEAN improves the performance of state-of-art offline model-based methods, b) examine whether conservative exploration is necessary, c) determine the impact of different hyperparameters on the performance of the algorithm. We conduct extensive experimentation and validate the effectiveness of our proposed approach, OCEAN, on the D4RL Mu Jo Co benchmark.
Researcher Affiliation Academia Fan Wu1,2, Rui Zhang 3, Qi Yi4, Yunkai Gao4, Jiaming Guo3, Shaohui Peng1, Siming Lan4, Husheng Han2, 3, Yansong Pan2, Kaizhao Yuan2, Pengwei Jin2, 3, Ruizhi Chen1, Yunji Chen2, 3, Ling Li1, 2, * 1 Intelligent Software Research Center, Institute of Software, CAS, Beijing, China 2 University of Chinese Academy of Sciences, UCAS, Beijing, China 3 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China 4 University of Science and Technology of China, USTC, Hefei, China wufan2020@iscas.ac.cn
Pseudocode Yes Algorithm 1: OCEAN
Open Source Code No We evaluate our approach on D4RL (Fu et al. 2020) Mu Jo Co datasets and our code is based on Offline RLKit library1. 1https://github.com/yihaosun1124/Offline RL-Kit
Open Datasets Yes Experiment results of our method on the D4RL Mu Jo Co benchmark show that OCEAN significantly improves the performance of existing algorithms. We evaluate our approach on D4RL (Fu et al. 2020) Mu Jo Co datasets
Dataset Splits No MOPO-based algorithms run for 2M gradient steps across 8 different random seeds and the final mean performance of 100 episodes is reported. No explicit details on dataset splits (e.g., percentages, counts) for training, validation, or testing are provided.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, memory) used for running experiments were mentioned in the paper.
Software Dependencies No We evaluate our approach on D4RL (Fu et al. 2020) Mu Jo Co datasets and our code is based on Offline RLKit library1. No version number is provided for the 'Offline RLKit library' or any other specific software components.
Experiment Setup Yes MOPO-based algorithms run for 2M gradient steps across 8 different random seeds and the final mean performance of 100 episodes is reported. The base hyperparameters that we use for OCEAN mostly follow the Offline RL-Kit library. Additionally, we conduct experiments to explore various uncertainty estimators and exploration strategies. Lastly, we perform parameter tuning experiments on the two pivotal hyperparameters in OCEAN, namely the penalty threshold u T and noise standard deviation δ.