OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning
Authors: Fan Wu, Rui Zhang, Qi Yi, Yunkai Gao, Jiaming Guo, Shaohui Peng, Siming Lan, Husheng Han, Yansong Pan, Kaizhao Yuan, Pengwei Jin, Ruizhi Chen, Yunji Chen, Ling Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results of our method on the D4RL Mu Jo Co benchmark show that OCEAN significantly improves the performance of existing algorithms. Our experiments aim to: a) evaluate how well OCEAN improves the performance of state-of-art offline model-based methods, b) examine whether conservative exploration is necessary, c) determine the impact of different hyperparameters on the performance of the algorithm. We conduct extensive experimentation and validate the effectiveness of our proposed approach, OCEAN, on the D4RL Mu Jo Co benchmark. |
| Researcher Affiliation | Academia | Fan Wu1,2, Rui Zhang 3, Qi Yi4, Yunkai Gao4, Jiaming Guo3, Shaohui Peng1, Siming Lan4, Husheng Han2, 3, Yansong Pan2, Kaizhao Yuan2, Pengwei Jin2, 3, Ruizhi Chen1, Yunji Chen2, 3, Ling Li1, 2, * 1 Intelligent Software Research Center, Institute of Software, CAS, Beijing, China 2 University of Chinese Academy of Sciences, UCAS, Beijing, China 3 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China 4 University of Science and Technology of China, USTC, Hefei, China wufan2020@iscas.ac.cn |
| Pseudocode | Yes | Algorithm 1: OCEAN |
| Open Source Code | No | We evaluate our approach on D4RL (Fu et al. 2020) Mu Jo Co datasets and our code is based on Offline RLKit library1. 1https://github.com/yihaosun1124/Offline RL-Kit |
| Open Datasets | Yes | Experiment results of our method on the D4RL Mu Jo Co benchmark show that OCEAN significantly improves the performance of existing algorithms. We evaluate our approach on D4RL (Fu et al. 2020) Mu Jo Co datasets |
| Dataset Splits | No | MOPO-based algorithms run for 2M gradient steps across 8 different random seeds and the final mean performance of 100 episodes is reported. No explicit details on dataset splits (e.g., percentages, counts) for training, validation, or testing are provided. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory) used for running experiments were mentioned in the paper. |
| Software Dependencies | No | We evaluate our approach on D4RL (Fu et al. 2020) Mu Jo Co datasets and our code is based on Offline RLKit library1. No version number is provided for the 'Offline RLKit library' or any other specific software components. |
| Experiment Setup | Yes | MOPO-based algorithms run for 2M gradient steps across 8 different random seeds and the final mean performance of 100 episodes is reported. The base hyperparameters that we use for OCEAN mostly follow the Offline RL-Kit library. Additionally, we conduct experiments to explore various uncertainty estimators and exploration strategies. Lastly, we perform parameter tuning experiments on the two pivotal hyperparameters in OCEAN, namely the penalty threshold u T and noise standard deviation δ. |