reproducibilityindex.ai

Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization

Authors: Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, Ming Li

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate the effectiveness of our method through extensive experiments, demonstrating substantial improvements in learning stability and efficiency compared to previous approaches.
Researcher Affiliation	Academia	National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China {kongr, wucy, gaocx}@lamda.nju.edu.cn, {zzzhang, lim}@nju.edu.cn
Pseudocode	Yes	Algorithm 1 CPR
Open Source Code	Yes	Our code is available at https://github.com/LAMDARL/CPR.
Open Datasets	Yes	We select the popular Mu Jo Co locomotion tasks from D4RL [Fu et al., 2020] as our benchmark for performance comparison.
Dataset Splits	No	The paper uses various D4RL datasets (e.g., walker2d-random-v2, hopper-medium-expert-v2) for offline pre-training and online fine-tuning, but does not specify explicit train/validation/test splits (e.g., percentages or sample counts) of these datasets within the paper. The online phase involves collecting new data into a replay buffer.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments, only general mentions of 'Deep Reinforcement Learning' and 'neural networks'.
Software Dependencies	No	The paper mentions using algorithms like SAC [Haarnoja et al., 2018], AWAC [Nair et al., 2020], TD3+BC [Fujimoto and Gu, 2021], and D4RL [Fu et al., 2020], but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	All offline algorithms are trained for 1000 epochs of 1000 random mini-batches each. ... We set revitalization interval Tr = 10 and revitalization fitting epochs Nr = 32. ... We run all methods for 300 episodes with 5 random seeds. In each episode, there are 1000 online interaction steps.