Towards Safe Reinforcement Learning with a Safety Editor Policy

Authors: Haonan Yu, Wei Xu, Haichao Zhang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SEditor on 12 Safety Gym (Ray et al., 2019) tasks and 2 safe car racing tasks adapted from Brockman et al. (2016), targeting at very low violation rates. SEditor obtains a much higher overall safety-weighted-utility (SWU) score (defined in Section 4) than four baselines. It demonstrates outstanding utility performance with constraint violation rates as low as once per 2k time steps, even in obstacle-dense environments. Our results reveal that the two-policy cooperation is critical, while simply doubling the size of a single policy network will not lead to comparable results. The choices of the action distance function and editing function are also important in certain circumstances.
Researcher Affiliation Industry Haonan Yu, Wei Xu, and Haichao Zhang Horizon Robotics Cupertino, CA 95014 {haonan.yu,wei.xu,haichao.zhang}@horizon.ai
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured steps formatted like code.
Open Source Code Yes Code is available at https://github.com/hnyu/seditor.
Open Datasets Yes We evaluate SEditor on 12 Safety Gym (Ray et al., 2019) tasks and 2 safe car racing tasks adapted from Brockman et al. (2016)... Our customized Safety Gym is available at https://github.com/hnyu/safety-gym.
Dataset Splits No The paper describes training and evaluation within a simulation environment but does not specify explicit training/validation/test dataset splits with percentages or sample counts, as data is generated through interaction.
Hardware Specification Yes All experiments are conducted on NVIDIA DGX servers (V100-32GB).
Software Dependencies Yes We use Python 3.9 and PyTorch 1.10 for all implementations.
Experiment Setup Yes All compared approaches including the variants of SEditor, share a common training configuration (e.g., replay buffer size, mini-batch size, learning rate, etc) as much as possible." and further details in Appendix H. "Appendix H Training Details" specifies values for "Learning rate", "Replay buffer size", "Mini-batch size", "Discount factor", "Entropy coefficient" etc.