Safe Reinforcement Learning by Imagining the Near Future

Authors: Garrett Thomas, Yuping Luo, Tengyu Ma

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks. In the experimental evaluation, we compare our algorithm to several model-free safe RL algorithms, as well as MBPO, on various continuous control tasks based on the Mu Jo Co simulator [Todorov et al., 2012].
Researcher Affiliation Academia Garrett Thomas Stanford University gwthomas@stanford.edu Yuping Luo Princeton University yupingl@cs.princeton.edu Tengyu Ma Stanford University tengyuma@stanford.edu
Pseudocode Yes Algorithm 1 Safe Model-Based Policy Optimization (SMBPO)
Open Source Code Yes Code is made available at https://github.com/gwthomas/Safe-MBPO.
Open Datasets Yes The tasks are described below: Hopper: Standard hopper environment from Open AI Gym... Cheetah-no-flip: The standard half-cheetah environment from Open AI Gym... Ant, Humanoid: Standard ant and humanoid environments from Open AI Gym... [Todorov et al., 2012] (MuJoCo simulator)
Dataset Splits No The paper describes continuous control tasks within reinforcement learning environments (OpenAI Gym, MuJoCo) where data is generated dynamically through interaction, rather than using pre-defined dataset splits for training, validation, and testing. Specific validation set splits are not mentioned.
Hardware Specification No The paper does not specify the particular hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No While the paper mentions using frameworks like MuJoCo and Open AI Gym environments, and cites PyTorch, it does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Additional experimental details, including hyperparameter selection, are given in Appendix A.3. Our algorithm requires very little hyperparameter tuning. We use γ = 0.99 in all experiments. We tried both H = 5 and H = 10 and found that H = 10 works slightly better, so we use H = 10 in all experiments. The scalar is a hyperparameter of SAC which controls the tradeoff between entropy and reward. We tune using the procedure suggested by Haarnoja et al. [2018b]. for a hyperparameter 2 (0, 1) which is often chosen small, e.g., 0.005.