Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning
Authors: Yecheng Jason Ma, Andrew Shen, Osbert Bastani, Jayaraman Dinesh5404-5412
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. |
| Researcher Affiliation | Academia | 1 University of Pennsylvania 2 University of Melbourne |
| Pseudocode | Yes | Algorithm 1: Safe MBRL with Conservative and Adaptive Penalty (CAP) |
| Open Source Code | Yes | Our code is included in the supplementary materials. |
| Open Datasets | Yes | A velocity-constrained version of Mujoco Half Cheetah (Todorov, Erez, and Tassa 2012), representative of robot tasks in which we want to avoid robots damaging themselves from over-exertion. |
| Dataset Splits | No | The paper describes training procedures in terms of environment steps and episodes but does not specify explicit training/validation/test dataset splits like percentages or sample counts, which is typical for reinforcement learning setups where policies are trained interactively and evaluated on the environment itself. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU models, or detailed computer specifications. |
| Software Dependencies | No | The paper mentions specific software components and methods like "Gurobi Optimizer", "constrained cross entropy method (CCEM)", and "Pla Net", but it does not provide specific version numbers for these or other software dependencies used in their experiments. |
| Experiment Setup | Yes | For each method (except the oracle), the training procedure lasts 30 iterations, in which each iterate includes (1) collecting 500 samples using the current LP solution, (2) updating ˆT, and (3) solving the new conservative LP objective. |