reproducibilityindex.ai

Towards Safe Reinforcement Learning with a Safety Editor Policy

Authors: Haonan Yu, Wei Xu, Haichao Zhang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SEditor on 12 Safety Gym (Ray et al., 2019) tasks and 2 safe car racing tasks adapted from Brockman et al. (2016), targeting at very low violation rates. SEditor obtains a much higher overall safety-weighted-utility (SWU) score (defined in Section 4) than four baselines. It demonstrates outstanding utility performance with constraint violation rates as low as once per 2k time steps, even in obstacle-dense environments. Our results reveal that the two-policy cooperation is critical, while simply doubling the size of a single policy network will not lead to comparable results. The choices of the action distance function and editing function are also important in certain circumstances.
Researcher Affiliation	Industry	Haonan Yu, Wei Xu, and Haichao Zhang Horizon Robotics Cupertino, CA 95014 {haonan.yu,wei.xu,haichao.zhang}@horizon.ai
Pseudocode	No	The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured steps formatted like code.
Open Source Code	Yes	Code is available at https://github.com/hnyu/seditor.
Open Datasets	Yes	We evaluate SEditor on 12 Safety Gym (Ray et al., 2019) tasks and 2 safe car racing tasks adapted from Brockman et al. (2016)... Our customized Safety Gym is available at https://github.com/hnyu/safety-gym.
Dataset Splits	No	The paper describes training and evaluation within a simulation environment but does not specify explicit training/validation/test dataset splits with percentages or sample counts, as data is generated through interaction.
Hardware Specification	Yes	All experiments are conducted on NVIDIA DGX servers (V100-32GB).
Software Dependencies	Yes	We use Python 3.9 and PyTorch 1.10 for all implementations.
Experiment Setup	Yes	All compared approaches including the variants of SEditor, share a common training configuration (e.g., replay buffer size, mini-batch size, learning rate, etc) as much as possible." and further details in Appendix H. "Appendix H Training Details" specifies values for "Learning rate", "Replay buffer size", "Mini-batch size", "Discount factor", "Entropy coefficient" etc.