Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Authors: Han Zheng, Xufang Luo, Pengfei Wei, Xuan Song, Dongsheng Li, Jing Jiang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct extensive experiments on popular continuous control tasks, and results show that our algorithm can learn the expert policy with high sample efficiency even when the quality of offline dataset is poor, e.g., random dataset.
Researcher Affiliation Collaboration Han Zheng1*, Xufang Luo2 , Pengfei Wei3, Xuan Song4, Dongsheng Li2, Jing Jiang1 1University of Technology Sydney, 2Microsoft Research Asia, 3National University of Singapore, 4Southern University of Science and Technology
Pseudocode Yes Algorithm 1: Greedy Conservative Q-ensemble Learning
Open Source Code No No explicit statement about providing open-source code for the described methodology or a link to a code repository was found.
Open Datasets Yes All experiments were done on the continuous control task set Mu Jo Co (Todorov, Erez, and Tassa 2012), and the offline dataset comes from the popular offline RL benchmark D4RL (Fu et al. 2020).
Dataset Splits No The paper mentions usage of offline and online data and training steps (Tinitial, Toff) and buffer sizes (20K and 3M), but does not explicitly specify train/validation/test dataset splits for reproducibility.
Hardware Specification No No specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments were provided.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) were explicitly listed for reproducibility.
Experiment Setup Yes Settings We set Ton in Algorithm 1 to 1K. To better exploit the offline dataset, we set Tinitial and Toff to 100K and 10K, respectively. For OORB, we set p = 0.5 for GCQL and p = 0.1 for GCTD3BC, and Ts to 10K for both of them. The size of online and offline buffer is set to 20K and 3M, respectively. We input ST as 100K. The above configurations keep the same across all tasks, datasets and methods.