Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL
Authors: Qin-Wen Luo, Ming-Kun Xie, Yewen Wang, Sheng-Jun Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods. The implementation is available at https://github.com/Qinwen Luo/OCR-CFT. |
| Researcher Affiliation | Collaboration | Qin-Wen Luo 1, Ming-Kun Xie 1,2, Ye-Wen Wang 1, Sheng-Jun Huang 1 1 Nanjing University of Aeronautics and Astronautics, Nanjing, China 2 RIKEN Center for Advanced Intelligence Project, Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 O2SAC Algorithm 2 O2TD3 Algorithm 3 O2PPO |
| Open Source Code | Yes | The implementation is available at https://github.com/Qinwen Luo/OCR-CFT. |
| Open Datasets | Yes | We perform experiments to validate the effectiveness of the proposed method on D4RL [9] Mu Jo Co and Ant Maze tasks, including Half Cheetah, Hopper, Walker2d and Ant Maze environments. |
| Dataset Splits | No | The paper does not explicitly state the training/validation/test dataset splits needed for reproduction beyond referencing standard D4RL tasks. |
| Hardware Specification | No | The paper mentions running an experiment on an "Nvidia 3070 GPU" for computational cost analysis but does not specify the hardware used for the main experimental results across all tasks. It states "we experiment our methods in many devices with different GPUs" in the NeurIPS checklist justification. |
| Software Dependencies | No | The paper mentions using PyTorch in Appendix H.1 and CORL library but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For medium and medium-replay datasets, we set τ as a linearly increasing variable from 0.125 to 2.0... And for medium-expert and expert datasets, we set it from 0.005 to 0.125 for safe update... We set the maximum value of the standard deviations of policies trained in the two datasets as 0.05... |