Safe Reinforcement Learning by Imagining the Near Future
Authors: Garrett Thomas, Yuping Luo, Tengyu Ma
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks. In the experimental evaluation, we compare our algorithm to several model-free safe RL algorithms, as well as MBPO, on various continuous control tasks based on the Mu Jo Co simulator [Todorov et al., 2012]. |
| Researcher Affiliation | Academia | Garrett Thomas Stanford University gwthomas@stanford.edu Yuping Luo Princeton University yupingl@cs.princeton.edu Tengyu Ma Stanford University tengyuma@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Safe Model-Based Policy Optimization (SMBPO) |
| Open Source Code | Yes | Code is made available at https://github.com/gwthomas/Safe-MBPO. |
| Open Datasets | Yes | The tasks are described below: Hopper: Standard hopper environment from Open AI Gym... Cheetah-no-flip: The standard half-cheetah environment from Open AI Gym... Ant, Humanoid: Standard ant and humanoid environments from Open AI Gym... [Todorov et al., 2012] (MuJoCo simulator) |
| Dataset Splits | No | The paper describes continuous control tasks within reinforcement learning environments (OpenAI Gym, MuJoCo) where data is generated dynamically through interaction, rather than using pre-defined dataset splits for training, validation, and testing. Specific validation set splits are not mentioned. |
| Hardware Specification | No | The paper does not specify the particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | While the paper mentions using frameworks like MuJoCo and Open AI Gym environments, and cites PyTorch, it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Additional experimental details, including hyperparameter selection, are given in Appendix A.3. Our algorithm requires very little hyperparameter tuning. We use γ = 0.99 in all experiments. We tried both H = 5 and H = 10 and found that H = 10 works slightly better, so we use H = 10 in all experiments. The scalar is a hyperparameter of SAC which controls the tradeoff between entropy and reward. We tune using the procedure suggested by Haarnoja et al. [2018b]. for a hyperparameter 2 (0, 1) which is often chosen small, e.g., 0.005. |