Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL

Authors: Qin-Wen Luo, Ming-Kun Xie, Yewen Wang, Sheng-Jun Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods. The implementation is available at https://github.com/Qinwen Luo/OCR-CFT.
Researcher Affiliation Collaboration Qin-Wen Luo 1, Ming-Kun Xie 1,2, Ye-Wen Wang 1, Sheng-Jun Huang 1 1 Nanjing University of Aeronautics and Astronautics, Nanjing, China 2 RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Pseudocode Yes Algorithm 1 O2SAC Algorithm 2 O2TD3 Algorithm 3 O2PPO
Open Source Code Yes The implementation is available at https://github.com/Qinwen Luo/OCR-CFT.
Open Datasets Yes We perform experiments to validate the effectiveness of the proposed method on D4RL [9] Mu Jo Co and Ant Maze tasks, including Half Cheetah, Hopper, Walker2d and Ant Maze environments.
Dataset Splits No The paper does not explicitly state the training/validation/test dataset splits needed for reproduction beyond referencing standard D4RL tasks.
Hardware Specification No The paper mentions running an experiment on an "Nvidia 3070 GPU" for computational cost analysis but does not specify the hardware used for the main experimental results across all tasks. It states "we experiment our methods in many devices with different GPUs" in the NeurIPS checklist justification.
Software Dependencies No The paper mentions using PyTorch in Appendix H.1 and CORL library but does not provide specific version numbers for these software components.
Experiment Setup Yes For medium and medium-replay datasets, we set τ as a linearly increasing variable from 0.125 to 2.0... And for medium-expert and expert datasets, we set it from 0.005 to 0.125 for safe update... We set the maximum value of the standard deviations of policies trained in the two datasets as 0.05...