Adaptive Policy Learning for Offline-to-Online Reinforcement Learning
Authors: Han Zheng, Xufang Luo, Pengfei Wei, Xuan Song, Dongsheng Li, Jing Jiang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive experiments on popular continuous control tasks, and results show that our algorithm can learn the expert policy with high sample efficiency even when the quality of offline dataset is poor, e.g., random dataset. |
| Researcher Affiliation | Collaboration | Han Zheng1*, Xufang Luo2 , Pengfei Wei3, Xuan Song4, Dongsheng Li2, Jing Jiang1 1University of Technology Sydney, 2Microsoft Research Asia, 3National University of Singapore, 4Southern University of Science and Technology |
| Pseudocode | Yes | Algorithm 1: Greedy Conservative Q-ensemble Learning |
| Open Source Code | No | No explicit statement about providing open-source code for the described methodology or a link to a code repository was found. |
| Open Datasets | Yes | All experiments were done on the continuous control task set Mu Jo Co (Todorov, Erez, and Tassa 2012), and the offline dataset comes from the popular offline RL benchmark D4RL (Fu et al. 2020). |
| Dataset Splits | No | The paper mentions usage of offline and online data and training steps (Tinitial, Toff) and buffer sizes (20K and 3M), but does not explicitly specify train/validation/test dataset splits for reproducibility. |
| Hardware Specification | No | No specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) were explicitly listed for reproducibility. |
| Experiment Setup | Yes | Settings We set Ton in Algorithm 1 to 1K. To better exploit the offline dataset, we set Tinitial and Toff to 100K and 10K, respectively. For OORB, we set p = 0.5 for GCQL and p = 0.1 for GCTD3BC, and Ts to 10K for both of them. The size of online and offline buffer is set to 20K and 3M, respectively. We input ST as 100K. The above configurations keep the same across all tasks, datasets and methods. |