Bayesian Design Principles for Offline-to-Online Reinforcement Learning
Authors: Hao Hu, Yiqin Yang, Jianing Ye, Chengjie Wu, Ziqing Mai, Yujing Hu, Tangjie Lv, Changjie Fan, Qianchuan Zhao, Chongjie Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data. |
| Researcher Affiliation | Collaboration | 1Institute for Interdisciplinary Sciences, Tsinghua University, China 2Institute of Automation, Chinese Academy of Sciences, China 3Fuxi AI Lab, Netease, Inc., Hangzhou, China 4Department of Automation, Tsinghua University, China 5Washington University in St. Louis, USA. |
| Pseudocode | Yes | Algorithm 1 BOORL, Offline Phase. Algorithm 2 BOORL, Online Phase. |
| Open Source Code | Yes | Our code is public online at https://github.com/Yiqin Yang/BOORL. |
| Open Datasets | Yes | To answer the questions above, we conduct experiments to test our proposed approach on the D4RL benchmark (Fu et al., 2020), which encompasses a variety of dataset qualities and domains. |
| Dataset Splits | Yes | We first gradually increase the proportion of the offline data in the training dataset from 15% to 50% to validate the effect of the sudden change in the replay data distribution. |
| Hardware Specification | Yes | All experiments are conducted on the same experimental setup, a single Ge Force RTX 3090 GPU and an Intel Core i7-6700k CPU at 4.00GHz. |
| Software Dependencies | No | The paper mentions software used (e.g., TD3+BC, TD3) but does not provide specific version numbers for these or other relevant libraries/dependencies. |
| Experiment Setup | Yes | We outline the hyper-parameters used by BOORL in Table 7. Table 7 includes specific values such as 'Critic learning rate 3e-4', 'Mini-batch size 256', 'Discount factor 0.99', etc. |