reproducibilityindex.ai

Safe Reinforcement Learning by Imagining the Near Future

Authors: Garrett Thomas, Yuping Luo, Tengyu Ma

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks. In the experimental evaluation, we compare our algorithm to several model-free safe RL algorithms, as well as MBPO, on various continuous control tasks based on the Mu Jo Co simulator [Todorov et al., 2012].
Researcher Affiliation	Academia	Garrett Thomas Stanford University gwthomas@stanford.edu Yuping Luo Princeton University yupingl@cs.princeton.edu Tengyu Ma Stanford University tengyuma@stanford.edu
Pseudocode	Yes	Algorithm 1 Safe Model-Based Policy Optimization (SMBPO)
Open Source Code	Yes	Code is made available at https://github.com/gwthomas/Safe-MBPO.
Open Datasets	Yes	The tasks are described below: Hopper: Standard hopper environment from Open AI Gym... Cheetah-no-ﬂip: The standard half-cheetah environment from Open AI Gym... Ant, Humanoid: Standard ant and humanoid environments from Open AI Gym... [Todorov et al., 2012] (MuJoCo simulator)
Dataset Splits	No	The paper describes continuous control tasks within reinforcement learning environments (OpenAI Gym, MuJoCo) where data is generated dynamically through interaction, rather than using pre-defined dataset splits for training, validation, and testing. Specific validation set splits are not mentioned.
Hardware Specification	No	The paper does not specify the particular hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	While the paper mentions using frameworks like MuJoCo and Open AI Gym environments, and cites PyTorch, it does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Additional experimental details, including hyperparameter selection, are given in Appendix A.3. Our algorithm requires very little hyperparameter tuning. We use γ = 0.99 in all experiments. We tried both H = 5 and H = 10 and found that H = 10 works slightly better, so we use H = 10 in all experiments. The scalar is a hyperparameter of SAC which controls the tradeoff between entropy and reward. We tune using the procedure suggested by Haarnoja et al. [2018b]. for a hyperparameter 2 (0, 1) which is often chosen small, e.g., 0.005.