SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning
Authors: Jiaheng Feng, Mingxiao Feng, Haolin Song, Wengang Zhou, Houqiang Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments to answer the following questions: (1) Can SUF stabilize unconstrained fine-tuning by eliminating policy collapse? (2) Can SUF outperform SOTA baselines when combined with diverse offline RL backbones, including IQL, TD3-BC, and CQL? (3) What are the contributions of each component in SUF? (4) What are the impacts of different hyperparameters on SUF? |
| Researcher Affiliation | Academia | EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {fengjiaheng, fmxustc, hlsong}@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn |
| Pseudocode | Yes | Algorithm 1: SUF pseudo-code |
| Open Source Code | No | The paper mentions that PEX and PROTO are implemented on author-provided codes, but it does not state that the authors of this paper are releasing their own source code for SUF. |
| Open Datasets | Yes | We consider all Mu Jo Co (Todorov, Erez, and Tassa 2012) environments from the public D4RL (Fu et al. 2020) benchmark: Halfcheetah, Hopper, Walker2d, and Ant. |
| Dataset Splits | No | The paper mentions the use of D4RL datasets (random, medium, medium-replay) and the number of pre-training and fine-tuning steps, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split). |
| Hardware Specification | No | It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC. This mentions a 'GPU cluster' but lacks specific GPU models or other detailed hardware specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In this work, we consistently set Gc = 20 and Gc = 1/4 across diverse backbones, environments, and datasets throughout fine-tuning for simplicity. For IQL-based methods, we perform 1 million update steps for offline pre-training and then 0.3 million environment steps for online fine-tuning. For SUF-TD3-BC and SUF-CQL, we perform 1 million pre-training steps and 0.1 million fine-tuning steps. |