Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FOSP: Fine-tuning Offline Safe Policy through World Models
Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our method, we design experiments across numerous Safety-Gymnasium tasks and real-world robotic arms including offline training and online fine-tuning. For simulation tasks, experimental results show robust performance in both offline and online fine-tuning phases and achieve nearly zero cost, outperforming prior RL algorithms. Furthermore, we deploy FOSP in real-world experiments, utilizing a Franka manipulator to perform trajectory planning tasks. |
| Researcher Affiliation | Academia | Chenyang Cao1, Yucheng Xin1, Silang Wu1, Longxiang He1, Zichen Yan2, Junbo Tan1 , Xueqian Wang1 1 Shenzhen International Graduate School, Tsinghua University, 2 College of Design and Engineering, National University of Singapore |
| Pseudocode | Yes | C PSEUDO CODE Algorithm 1 FOSP: Fine-tuning Offline Safe Policy through World Models |
| Open Source Code | No | The text does not contain a specific link to a code repository or an explicit statement about releasing the source code for the methodology described. |
| Open Datasets | Yes | We consider five tasks on Safety-Gymnasium benchmark (Ji et al., 2023) environments. |
| Dataset Splits | No | The paper describes the composition of the offline dataset (e.g., 'mixed in a 1:1:1 ratio, with each part containing 200 trajectories') and the offline-to-online training procedure (e.g., '1M steps for offline and 0.5M steps for online fine-tuning'), but it does not specify explicit train/validation/test splits of the collected dataset for model evaluation. |
| Hardware Specification | Yes | The hardware used comprised four Ge Force RTX 4090 GPUs and an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz. |
| Software Dependencies | Yes | The experiments for FOSP were conducted in a Python 3.10 environment with JAX 0.4.26. Our setup included CUDA version 12.1, running on Ubuntu 20.04. |
| Experiment Setup | Yes | Table 4. Hyperparameters for FOSP: Module Name, Symbol, Value (e.g., Batch size B 64, Learning rate lwm 10^-4, Discount horizon γ 0.997, AWR temperature β1, β2 10, Actor entropy regularize η 3 10^-4). |