SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning

Authors: Jiaheng Feng, Mingxiao Feng, Haolin Song, Wengang Zhou, Houqiang Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments to answer the following questions: (1) Can SUF stabilize unconstrained fine-tuning by eliminating policy collapse? (2) Can SUF outperform SOTA baselines when combined with diverse offline RL backbones, including IQL, TD3-BC, and CQL? (3) What are the contributions of each component in SUF? (4) What are the impacts of different hyperparameters on SUF?
Researcher Affiliation Academia EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {fengjiaheng, fmxustc, hlsong}@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn
Pseudocode Yes Algorithm 1: SUF pseudo-code
Open Source Code No The paper mentions that PEX and PROTO are implemented on author-provided codes, but it does not state that the authors of this paper are releasing their own source code for SUF.
Open Datasets Yes We consider all Mu Jo Co (Todorov, Erez, and Tassa 2012) environments from the public D4RL (Fu et al. 2020) benchmark: Halfcheetah, Hopper, Walker2d, and Ant.
Dataset Splits No The paper mentions the use of D4RL datasets (random, medium, medium-replay) and the number of pre-training and fine-tuning steps, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification No It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC. This mentions a 'GPU cluster' but lacks specific GPU models or other detailed hardware specifications.
Software Dependencies No The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment.
Experiment Setup Yes In this work, we consistently set Gc = 20 and Gc = 1/4 across diverse backbones, environments, and datasets throughout fine-tuning for simplicity. For IQL-based methods, we perform 1 million update steps for offline pre-training and then 0.3 million environment steps for online fine-tuning. For SUF-TD3-BC and SUF-CQL, we perform 1 million pre-training steps and 0.1 million fine-tuning steps.