reproducibilityindex.ai

Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning

Authors: Yecheng Jason Ma, Andrew Shen, Osbert Bastani, Jayaraman Dinesh5404-5412

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efﬁciency while incurring fewer violations than prior safe RL algorithms.
Researcher Affiliation	Academia	1 University of Pennsylvania 2 University of Melbourne
Pseudocode	Yes	Algorithm 1: Safe MBRL with Conservative and Adaptive Penalty (CAP)
Open Source Code	Yes	Our code is included in the supplementary materials.
Open Datasets	Yes	A velocity-constrained version of Mujoco Half Cheetah (Todorov, Erez, and Tassa 2012), representative of robot tasks in which we want to avoid robots damaging themselves from over-exertion.
Dataset Splits	No	The paper describes training procedures in terms of environment steps and episodes but does not specify explicit training/validation/test dataset splits like percentages or sample counts, which is typical for reinforcement learning setups where policies are trained interactively and evaluated on the environment itself.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU models, or detailed computer specifications.
Software Dependencies	No	The paper mentions specific software components and methods like "Gurobi Optimizer", "constrained cross entropy method (CCEM)", and "Pla Net", but it does not provide specific version numbers for these or other software dependencies used in their experiments.
Experiment Setup	Yes	For each method (except the oracle), the training procedure lasts 30 iterations, in which each iterate includes (1) collecting 500 samples using the current LP solution, (2) updating ˆT, and (3) solving the new conservative LP objective.