Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FOSP: Fine-tuning Offline Safe Policy through World Models

Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our method, we design experiments across numerous Safety-Gymnasium tasks and real-world robotic arms including offline training and online fine-tuning. For simulation tasks, experimental results show robust performance in both offline and online fine-tuning phases and achieve nearly zero cost, outperforming prior RL algorithms. Furthermore, we deploy FOSP in real-world experiments, utilizing a Franka manipulator to perform trajectory planning tasks.
Researcher Affiliation Academia Chenyang Cao1, Yucheng Xin1, Silang Wu1, Longxiang He1, Zichen Yan2, Junbo Tan1 , Xueqian Wang1 1 Shenzhen International Graduate School, Tsinghua University, 2 College of Design and Engineering, National University of Singapore
Pseudocode Yes C PSEUDO CODE Algorithm 1 FOSP: Fine-tuning Offline Safe Policy through World Models
Open Source Code No The text does not contain a specific link to a code repository or an explicit statement about releasing the source code for the methodology described.
Open Datasets Yes We consider five tasks on Safety-Gymnasium benchmark (Ji et al., 2023) environments.
Dataset Splits No The paper describes the composition of the offline dataset (e.g., 'mixed in a 1:1:1 ratio, with each part containing 200 trajectories') and the offline-to-online training procedure (e.g., '1M steps for offline and 0.5M steps for online fine-tuning'), but it does not specify explicit train/validation/test splits of the collected dataset for model evaluation.
Hardware Specification Yes The hardware used comprised four Ge Force RTX 4090 GPUs and an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz.
Software Dependencies Yes The experiments for FOSP were conducted in a Python 3.10 environment with JAX 0.4.26. Our setup included CUDA version 12.1, running on Ubuntu 20.04.
Experiment Setup Yes Table 4. Hyperparameters for FOSP: Module Name, Symbol, Value (e.g., Batch size B 64, Learning rate lwm 10^-4, Discount horizon γ 0.997, AWR temperature β1, β2 10, Actor entropy regularize η 3 10^-4).