Towards Safe Reinforcement Learning with a Safety Editor Policy
Authors: Haonan Yu, Wei Xu, Haichao Zhang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SEditor on 12 Safety Gym (Ray et al., 2019) tasks and 2 safe car racing tasks adapted from Brockman et al. (2016), targeting at very low violation rates. SEditor obtains a much higher overall safety-weighted-utility (SWU) score (defined in Section 4) than four baselines. It demonstrates outstanding utility performance with constraint violation rates as low as once per 2k time steps, even in obstacle-dense environments. Our results reveal that the two-policy cooperation is critical, while simply doubling the size of a single policy network will not lead to comparable results. The choices of the action distance function and editing function are also important in certain circumstances. |
| Researcher Affiliation | Industry | Haonan Yu, Wei Xu, and Haichao Zhang Horizon Robotics Cupertino, CA 95014 {haonan.yu,wei.xu,haichao.zhang}@horizon.ai |
| Pseudocode | No | The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured steps formatted like code. |
| Open Source Code | Yes | Code is available at https://github.com/hnyu/seditor. |
| Open Datasets | Yes | We evaluate SEditor on 12 Safety Gym (Ray et al., 2019) tasks and 2 safe car racing tasks adapted from Brockman et al. (2016)... Our customized Safety Gym is available at https://github.com/hnyu/safety-gym. |
| Dataset Splits | No | The paper describes training and evaluation within a simulation environment but does not specify explicit training/validation/test dataset splits with percentages or sample counts, as data is generated through interaction. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA DGX servers (V100-32GB). |
| Software Dependencies | Yes | We use Python 3.9 and PyTorch 1.10 for all implementations. |
| Experiment Setup | Yes | All compared approaches including the variants of SEditor, share a common training configuration (e.g., replay buffer size, mini-batch size, learning rate, etc) as much as possible." and further details in Appendix H. "Appendix H Training Details" specifies values for "Learning rate", "Replay buffer size", "Mini-batch size", "Discount factor", "Entropy coefficient" etc. |