SafeDreamer: Safe Reinforcement Learning with World Models

Authors: Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, Yaodong Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. Further details can be found in the code repository: https://github.com/PKU-Alignment/Safe Dreamer. and 6 EXPERIMENTAL RESULTS We use different robotic agents in Safety-Gymnasium (Ji et al., 2023b). The goal in the environments is to navigate robots to predetermined locations while avoiding collisions with other objects. We evaluate algorithms using five fundamental environments (refer to Appendix C.4). Performance is assessed using metrics from Ray et al. (2019):
Researcher Affiliation Academia Weidong Huang , , Jiaming Ji , Chunhe Xia , , Borong Zhang Yaodong Yang , Institute of Artificial Intelligence, Peking University School of Cyber Science and Technology, Beihang University yaodong.yang@pku.edu.cn
Pseudocode Yes Algorithm 1: Online Safety-Reward Planning. and Algorithm 2: Safe Dreamer. and Algorithm 3: PID Lagrangian (Stooke et al., 2020).
Open Source Code Yes Further details can be found in the code repository: https://github.com/PKU-Alignment/Safe Dreamer. and To advance open-source science, we release 80+ model checkpoints along with the code for training and evaluating Safe Dreamer agents. These resources are accessible at: https: //github.com/PKU-Alignment/Safe Dreamer.
Open Datasets Yes We use different robotic agents in Safety-Gymnasium (Ji et al., 2023b). and All Safe Dreamer agents are trained on one Nvidia 3090Ti GPU each and experiments are conducted utilizing the Safety-Gymnasium 2, Meta Drive3 and Gymnasium benchmark4. and We utilize the Car-Racing environment from Open AI Gym s gymnasium.envs.box2d interface, as established by Brockman et al. (2016).
Dataset Splits No The paper describes using a 'replay buffer of past experiences' for training the world model and performing evaluations over a set number of episodes without network updates. However, it does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or counts) for its own method. The only mention of 'Validation dataset/Train dataset 10%/90%' is for the baseline MBPPO-Lag in Table 5 of Appendix B, but not for Safe Dreamer itself.
Hardware Specification Yes The Safe Dreamer experiments were executed utilizing Python3 and Jax 0.3.25, facilitated by CUDA 11.7, on an Ubuntu 20.04.2 LTS system (GNU/Linux 5.8.0-59-generic x86 64) equipped with 40 Intel(R) Xeon(R) Silver 4210R CPU cores (operating at 240GHz), 251GB of RAM, and an array of 8 Ge Force RTX 3090Ti GPUs.
Software Dependencies Yes The Safe Dreamer experiments were executed utilizing Python3 and Jax 0.3.25, facilitated by CUDA 11.7, on an Ubuntu 20.04.2 LTS system (GNU/Linux 5.8.0-59-generic x86 64)
Experiment Setup Yes For further hyperparameter configurations, refer to Appendix B. and Table 2: Hyperparameters for Safe Dreamer.