Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization
Authors: Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, Ming Li
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the effectiveness of our method through extensive experiments, demonstrating substantial improvements in learning stability and efficiency compared to previous approaches. |
| Researcher Affiliation | Academia | National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China {kongr, wucy, gaocx}@lamda.nju.edu.cn, {zzzhang, lim}@nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 CPR |
| Open Source Code | Yes | Our code is available at https://github.com/LAMDARL/CPR. |
| Open Datasets | Yes | We select the popular Mu Jo Co locomotion tasks from D4RL [Fu et al., 2020] as our benchmark for performance comparison. |
| Dataset Splits | No | The paper uses various D4RL datasets (e.g., walker2d-random-v2, hopper-medium-expert-v2) for offline pre-training and online fine-tuning, but does not specify explicit train/validation/test splits (e.g., percentages or sample counts) of these datasets within the paper. The online phase involves collecting new data into a replay buffer. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments, only general mentions of 'Deep Reinforcement Learning' and 'neural networks'. |
| Software Dependencies | No | The paper mentions using algorithms like SAC [Haarnoja et al., 2018], AWAC [Nair et al., 2020], TD3+BC [Fujimoto and Gu, 2021], and D4RL [Fu et al., 2020], but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | All offline algorithms are trained for 1000 epochs of 1000 random mini-batches each. ... We set revitalization interval Tr = 10 and revitalization fitting epochs Nr = 32. ... We run all methods for 300 episodes with 5 random seeds. In each episode, there are 1000 online interaction steps. |