Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning

Authors: Yecheng Jason Ma, Andrew Shen, Osbert Bastani, Jayaraman Dinesh5404-5412

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms.
Researcher Affiliation Academia 1 University of Pennsylvania 2 University of Melbourne
Pseudocode Yes Algorithm 1: Safe MBRL with Conservative and Adaptive Penalty (CAP)
Open Source Code Yes Our code is included in the supplementary materials.
Open Datasets Yes A velocity-constrained version of Mujoco Half Cheetah (Todorov, Erez, and Tassa 2012), representative of robot tasks in which we want to avoid robots damaging themselves from over-exertion.
Dataset Splits No The paper describes training procedures in terms of environment steps and episodes but does not specify explicit training/validation/test dataset splits like percentages or sample counts, which is typical for reinforcement learning setups where policies are trained interactively and evaluated on the environment itself.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU models, or detailed computer specifications.
Software Dependencies No The paper mentions specific software components and methods like "Gurobi Optimizer", "constrained cross entropy method (CCEM)", and "Pla Net", but it does not provide specific version numbers for these or other software dependencies used in their experiments.
Experiment Setup Yes For each method (except the oracle), the training procedure lasts 30 iterations, in which each iterate includes (1) collecting 500 samples using the current LP solution, (2) updating ˆT, and (3) solving the new conservative LP objective.