Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning
Authors: Jiaheng Feng, Mingxiao Feng, Haolin Song, Wengang Zhou, Houqiang Li
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments to answer the following questions: (1) Can SUF stabilize unconstrained fine-tuning by eliminating policy collapse? (2) Can SUF outperform SOTA baselines when combined with diverse offline RL backbones, including IQL, TD3-BC, and CQL? (3) What are the contributions of each component in SUF? (4) What are the impacts of different hyperparameters on SUF? |
| Researcher Affiliation | Academia | EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: SUF pseudo-code |
| Open Source Code | No | The paper mentions that PEX and PROTO are implemented on author-provided codes, but it does not state that the authors of this paper are releasing their own source code for SUF. |
| Open Datasets | Yes | We consider all Mu Jo Co (Todorov, Erez, and Tassa 2012) environments from the public D4RL (Fu et al. 2020) benchmark: Halfcheetah, Hopper, Walker2d, and Ant. |
| Dataset Splits | No | The paper mentions the use of D4RL datasets (random, medium, medium-replay) and the number of pre-training and fine-tuning steps, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split). |
| Hardware Specification | No | It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC, and the Supercomputing Center of the USTC. This mentions a 'GPU cluster' but lacks specific GPU models or other detailed hardware specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In this work, we consistently set Gc = 20 and Gc = 1/4 across diverse backbones, environments, and datasets throughout fine-tuning for simplicity. For IQL-based methods, we perform 1 million update steps for offline pre-training and then 0.3 million environment steps for online fine-tuning. For SUF-TD3-BC and SUF-CQL, we perform 1 million pre-training steps and 0.1 million fine-tuning steps. |