Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FOSP: Fine-tuning Offline Safe Policy through World Models

Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our method, we design experiments across numerous Safety-Gymnasium tasks and real-world robotic arms including offline training and online fine-tuning. For simulation tasks, experimental results show robust performance in both offline and online fine-tuning phases and achieve nearly zero cost, outperforming prior RL algorithms. Furthermore, we deploy FOSP in real-world experiments, utilizing a Franka manipulator to perform trajectory planning tasks.
Researcher Affiliation Academia Chenyang Cao1, Yucheng Xin1, Silang Wu1, Longxiang He1, Zichen Yan2, Junbo Tan1 , Xueqian Wang1 1 Shenzhen International Graduate School, Tsinghua University, 2 College of Design and Engineering, National University of Singapore
Pseudocode Yes C PSEUDO CODE Algorithm 1 FOSP: Fine-tuning Offline Safe Policy through World Models
Open Source Code No The text does not contain a specific link to a code repository or an explicit statement about releasing the source code for the methodology described.
Open Datasets Yes We consider five tasks on Safety-Gymnasium benchmark (Ji et al., 2023) environments.
Dataset Splits No The paper describes the composition of the offline dataset (e.g., 'mixed in a 1:1:1 ratio, with each part containing 200 trajectories') and the offline-to-online training procedure (e.g., '1M steps for offline and 0.5M steps for online fine-tuning'), but it does not specify explicit train/validation/test splits of the collected dataset for model evaluation.
Hardware Specification Yes The hardware used comprised four Ge Force RTX 4090 GPUs and an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz.
Software Dependencies Yes The experiments for FOSP were conducted in a Python 3.10 environment with JAX 0.4.26. Our setup included CUDA version 12.1, running on Ubuntu 20.04.
Experiment Setup Yes Table 4. Hyperparameters for FOSP: Module Name, Symbol, Value (e.g., Batch size B 64, Learning rate lwm 10^-4, Discount horizon γ 0.997, AWR temperature β1, β2 10, Actor entropy regularize η 3 10^-4).