reproducibilityindex.ai

SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning

Authors: Jiaheng Feng, Mingxiao Feng, Haolin Song, Wengang Zhou, Houqiang Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to answer the following questions: (1) Can SUF stabilize unconstrained fine-tuning by eliminating policy collapse? (2) Can SUF outperform SOTA baselines when combined with diverse offline RL backbones, including IQL, TD3-BC, and CQL? (3) What are the contributions of each component in SUF? (4) What are the impacts of different hyperparameters on SUF?
Researcher Affiliation	Academia	EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {fengjiaheng, fmxustc, hlsong}@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn
Pseudocode	Yes	Algorithm 1: SUF pseudo-code
Open Source Code	No	The paper mentions that PEX and PROTO are implemented on author-provided codes, but it does not state that the authors of this paper are releasing their own source code for SUF.
Open Datasets	Yes	We consider all Mu Jo Co (Todorov, Erez, and Tassa 2012) environments from the public D4RL (Fu et al. 2020) benchmark: Halfcheetah, Hopper, Walker2d, and Ant.
Dataset Splits	No	The paper mentions the use of D4RL datasets (random, medium, medium-replay) and the number of pre-training and fine-tuning steps, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification	No	It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC. This mentions a 'GPU cluster' but lacks specific GPU models or other detailed hardware specifications.
Software Dependencies	No	The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	In this work, we consistently set Gc = 20 and Gc = 1/4 across diverse backbones, environments, and datasets throughout fine-tuning for simplicity. For IQL-based methods, we perform 1 million update steps for offline pre-training and then 0.3 million environment steps for online fine-tuning. For SUF-TD3-BC and SUF-CQL, we perform 1 million pre-training steps and 0.1 million fine-tuning steps.