Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization

Authors: Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, Ming Li

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the effectiveness of our method through extensive experiments, demonstrating substantial improvements in learning stability and efficiency compared to previous approaches.
Researcher Affiliation Academia National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China {kongr, wucy, gaocx}@lamda.nju.edu.cn, {zzzhang, lim}@nju.edu.cn
Pseudocode Yes Algorithm 1 CPR
Open Source Code Yes Our code is available at https://github.com/LAMDARL/CPR.
Open Datasets Yes We select the popular Mu Jo Co locomotion tasks from D4RL [Fu et al., 2020] as our benchmark for performance comparison.
Dataset Splits No The paper uses various D4RL datasets (e.g., walker2d-random-v2, hopper-medium-expert-v2) for offline pre-training and online fine-tuning, but does not specify explicit train/validation/test splits (e.g., percentages or sample counts) of these datasets within the paper. The online phase involves collecting new data into a replay buffer.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments, only general mentions of 'Deep Reinforcement Learning' and 'neural networks'.
Software Dependencies No The paper mentions using algorithms like SAC [Haarnoja et al., 2018], AWAC [Nair et al., 2020], TD3+BC [Fujimoto and Gu, 2021], and D4RL [Fu et al., 2020], but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes All offline algorithms are trained for 1000 epochs of 1000 random mini-batches each. ... We set revitalization interval Tr = 10 and revitalization fitting epochs Nr = 32. ... We run all methods for 300 episodes with 5 random seeds. In each episode, there are 1000 online interaction steps.