Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning
Authors: Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, Yang Yu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically we have verified that our proposed uncertainty quantification can be significantly closer to the true Bellman error than the compared methods. Consequently, MOBILE outperforms prior offline RL approaches on most tasks of D4RL and Neo RL benchmarks. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China 2Polixir Technologies, Nanjing, Jiangsu, China 3Peng Cheng Laboratory, Shenzhen, 518055, China. |
| Pseudocode | Yes | Algorithm 1 MOBILE |
| Open Source Code | Yes | The code is available at https://github.com/yihaosun1124/mobile. |
| Open Datasets | Yes | standard D4RL offline RL benchmark (Fu et al., 2020), which includes Gym and Adroit domains, as well as the near-real-world Neo RL (Qin et al., 2022) benchmark. |
| Dataset Splits | Yes | We train an ensemble of 7 such dynamics models following (Janner et al., 2019; Yu et al., 2020) and pick the best 5 models based on the validation prediction error on a held-out set that contains 1000 transitions in the offline dataset D. |
| Hardware Specification | Yes | All the experiments are run with a single Ge Force GTX 3070 GPU and an AMD Ryzen 5900X CPU at 4.8GHz. |
| Software Dependencies | No | The paper mentions 'SAC' and 'Adam Optimizers' but does not provide specific version numbers for any software dependencies or frameworks. |
| Experiment Setup | Yes | Table 3. Hyperparameters of Policy Optimization in MOBILE. ... K 2 ... Policy network FC(256,256) ... Q-network FC(256,256) ... τ 5e 3 ... γ 0.99 ... lr of actor 1e 4 ... lr of critic 3e 4 ... Batch size 256 ... Niter 3M. |