reproducibilityindex.ai

Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL

Authors: Qin-Wen Luo, Ming-Kun Xie, Yewen Wang, Sheng-Jun Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods. The implementation is available at https://github.com/Qinwen Luo/OCR-CFT.
Researcher Affiliation	Collaboration	Qin-Wen Luo 1, Ming-Kun Xie 1,2, Ye-Wen Wang 1, Sheng-Jun Huang 1 1 Nanjing University of Aeronautics and Astronautics, Nanjing, China 2 RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Pseudocode	Yes	Algorithm 1 O2SAC Algorithm 2 O2TD3 Algorithm 3 O2PPO
Open Source Code	Yes	The implementation is available at https://github.com/Qinwen Luo/OCR-CFT.
Open Datasets	Yes	We perform experiments to validate the effectiveness of the proposed method on D4RL [9] Mu Jo Co and Ant Maze tasks, including Half Cheetah, Hopper, Walker2d and Ant Maze environments.
Dataset Splits	No	The paper does not explicitly state the training/validation/test dataset splits needed for reproduction beyond referencing standard D4RL tasks.
Hardware Specification	No	The paper mentions running an experiment on an "Nvidia 3070 GPU" for computational cost analysis but does not specify the hardware used for the main experimental results across all tasks. It states "we experiment our methods in many devices with different GPUs" in the NeurIPS checklist justification.
Software Dependencies	No	The paper mentions using PyTorch in Appendix H.1 and CORL library but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For medium and medium-replay datasets, we set τ as a linearly increasing variable from 0.125 to 2.0... And for medium-expert and expert datasets, we set it from 0.005 to 0.125 for safe update... We set the maximum value of the standard deviations of policies trained in the two datasets as 0.05...