reproducibilityindex.ai

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Authors: Han Zheng, Xufang Luo, Pengfei Wei, Xuan Song, Dongsheng Li, Jing Jiang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we conduct extensive experiments on popular continuous control tasks, and results show that our algorithm can learn the expert policy with high sample efficiency even when the quality of offline dataset is poor, e.g., random dataset.
Researcher Affiliation	Collaboration	Han Zheng1*, Xufang Luo2 , Pengfei Wei3, Xuan Song4, Dongsheng Li2, Jing Jiang1 1University of Technology Sydney, 2Microsoft Research Asia, 3National University of Singapore, 4Southern University of Science and Technology
Pseudocode	Yes	Algorithm 1: Greedy Conservative Q-ensemble Learning
Open Source Code	No	No explicit statement about providing open-source code for the described methodology or a link to a code repository was found.
Open Datasets	Yes	All experiments were done on the continuous control task set Mu Jo Co (Todorov, Erez, and Tassa 2012), and the offline dataset comes from the popular offline RL benchmark D4RL (Fu et al. 2020).
Dataset Splits	No	The paper mentions usage of offline and online data and training steps (Tinitial, Toff) and buffer sizes (20K and 3M), but does not explicitly specify train/validation/test dataset splits for reproducibility.
Hardware Specification	No	No specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments were provided.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) were explicitly listed for reproducibility.
Experiment Setup	Yes	Settings We set Ton in Algorithm 1 to 1K. To better exploit the offline dataset, we set Tinitial and Toff to 100K and 10K, respectively. For OORB, we set p = 0.5 for GCQL and p = 0.1 for GCTD3BC, and Ts to 10K for both of them. The size of online and offline buffer is set to 20K and 3M, respectively. We input ST as 100K. The above configurations keep the same across all tasks, datasets and methods.