Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
FOSP: Fine-tuning Offline Safe Policy through World Models
Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our method, we design experiments across numerous Safety-Gymnasium tasks and real-world robotic arms including offline training and online fine-tuning. For simulation tasks, experimental results show robust performance in both offline and online fine-tuning phases and achieve nearly zero cost, outperforming prior RL algorithms. Furthermore, we deploy FOSP in real-world experiments, utilizing a Franka manipulator to perform trajectory planning tasks. |
| Researcher Affiliation | Academia | Chenyang Cao1, Yucheng Xin1, Silang Wu1, Longxiang He1, Zichen Yan2, Junbo Tan1 , Xueqian Wang1 1 Shenzhen International Graduate School, Tsinghua University, 2 College of Design and Engineering, National University of Singapore |
| Pseudocode | Yes | C PSEUDO CODE Algorithm 1 FOSP: Fine-tuning Offline Safe Policy through World Models |
| Open Source Code | No | The text does not contain a specific link to a code repository or an explicit statement about releasing the source code for the methodology described. |
| Open Datasets | Yes | We consider five tasks on Safety-Gymnasium benchmark (Ji et al., 2023) environments. |
| Dataset Splits | No | The paper describes the composition of the offline dataset (e.g., 'mixed in a 1:1:1 ratio, with each part containing 200 trajectories') and the offline-to-online training procedure (e.g., '1M steps for offline and 0.5M steps for online fine-tuning'), but it does not specify explicit train/validation/test splits of the collected dataset for model evaluation. |
| Hardware Specification | Yes | The hardware used comprised four Ge Force RTX 4090 GPUs and an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz. |
| Software Dependencies | Yes | The experiments for FOSP were conducted in a Python 3.10 environment with JAX 0.4.26. Our setup included CUDA version 12.1, running on Ubuntu 20.04. |
| Experiment Setup | Yes | Table 4. Hyperparameters for FOSP: Module Name, Symbol, Value (e.g., Batch size B 64, Learning rate lwm 10^-4, Discount horizon γ 0.997, AWR temperature β1, β2 10, Actor entropy regularize η 3 10^-4). |