ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles
Authors: Kai Zhao, Jianye Hao, Yi Ma, Jinyi Liu, Yan Zheng, Zhaopeng Meng
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods. |
| Researcher Affiliation | Collaboration | Kai Zhao1,2 , Jianye Hao1, , Yi Ma1 , Jinyi Liu1 , Yan Zheng1 and Zhaopeng Meng1 1College of Intelligence and Computing, Tianjin University 2Bilibili |
| Pseudocode | Yes | Algorithm 1 summarizes the offline and online procedures of ENOTO. |
| Open Source Code | No | The paper mentions relying on 'publicly available and widely accepted code repositories from Git Hub [Seno and Imai, 2022; Tarasov et al., 2022]' for baselines, but it does not state that the authors' own code for the ENOTO methodology is open-source or provide a link to it. |
| Open Datasets | Yes | We first evaluate our ENOTO framework on Mu Jo Co [Todorov et al., 2012] locomotion tasks, i.e., Half Cheetah, Walker2d, and Hopper from the D4RL benchmark suite [Fu et al., 2020]. |
| Dataset Splits | No | The paper mentions using 'medium, medium-replay and medium-expert datasets' and a pre-training duration ('1M training steps') and online fine-tuning duration ('250K environmental steps'). However, it does not explicitly provide information on specific train/validation/test dataset splits (e.g., percentages or counts) within the main text, only mentioning that 'Additional experimental details can be found in the appendix.' |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | No | The paper specifies training steps ('1M training steps in the offline phase and perform online fine-tuning for 250K environmental steps') and describes algorithmic components. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer settings) or other detailed configuration settings for the experimental setup in the main text, instead deferring 'Additional experimental details' to the appendix. |