Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning
Authors: Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhi-Hong Deng, Animesh Garg, Peng Liu, Zhaoran Wang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms. |
| Researcher Affiliation | Collaboration | Chenjia Bai Harbin Institute of Technology Lingxiao Wang Northwestern University Zhuoran Yang Princeton University Zhihong Deng University of Technology Sydney Animesh Garg University of Toronto Vector Institute, NVIDIA Peng Liu Harbin Institute of Technology Zhaoran Wang Northwestern University |
| Pseudocode | Yes | Algorithm 1 PBRL algorithm |
| Open Source Code | Yes | The code is available at https://github.com/Baichenjia/PBRL. |
| Open Datasets | Yes | Our experiments on the D4RL benchmark (Fu et al., 2020) show that PBRL provides reasonable uncertainty quantification and yields better performance compared to the state-of-the-art algorithms. The dataset is released at http://rail.eecs.berkeley.edu/datasets/offline_rl/gym_mujoco_v2_ old/. |
| Dataset Splits | No | The paper uses the D4RL benchmark datasets, which are offline datasets used for training policies in offline RL. It does not describe explicit train/validation/test splits of these datasets in the traditional supervised learning sense (e.g., specific percentages or sample counts for validation). |
| Hardware Specification | Yes | We run experiments on a single A100 GPU. |
| Software Dependencies | No | The paper mentions several software implementations and libraries used (e.g., “SAC implementations”, “CQL”, “BEAR”, “UWAC”, “MOPO”, “TD3-BC”) but does not specify their version numbers. |
| Experiment Setup | Yes | Table 2: Hyper-parameters of PBRL, which includes: K 10, Q-network FC(256,256,256), βin 0.01, βood 5.0 0.2 (decaying strategy), τ 0.005, γ 0.99, lr of actor 1e-4, lr of critic 3e-4, Optimizer Adam, H 1M, Nood 10. |