Enhancing Safe Exploration Using Safety State Augmentation
Authors: Aivar Sootla, Alexander Cowen-Rivers, Jun Wang, Haitham Bou Ammar
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply this idea to two safe RL problems: RL with constraints imposed on an average cost, and RL with constraints imposed on a cost with probability one. Our experiments suggest that simmering a safe algorithm can improve safety during training for both settings. We further show that Simmer can stabilize training and improve the performance of safe RL with average constraints. |
| Researcher Affiliation | Collaboration | Aivar Sootla Byju s Lab aivar.sootla@gmail.com Alexander I. Cowen-Rivers Technische Universit at Darmstadt mc rivers@icloud.com Jun Wang University College London jun.wang@cs.ucl.ac.uk Haitham Bou Ammar Huawei R&D haitham.ammar@huawei.com |
| Pseudocode | Yes | Algorithm 1: PI SIMMER (basic version) Algorithm 2: Q-SIMMER |
| Open Source Code | Yes | The code for PI Simmer and Q Simmer is available at https://github.com/huawei-noah/HEBO/ tree/master/SIMMER. |
| Open Datasets | Yes | Environments: We use the safe pendulum environment defined in [16], and we also use the custom-made safety gym environment with deterministic constraints, which we call static point goal [51]. ... The rest of our tests are performed on the safety gym benchmarks [37]. |
| Dataset Splits | No | The paper does not explicitly provide information about train/validation/test splits, only mentioning datasets used for testing/evaluation. For example, it states "Mean returns and cost are computed over a hundred different trajectories obtained for three different seeds.", but not data splits. |
| Hardware Specification | Yes | Computational resources: We performed all computations on a PC equipped with 512GB of RAM, two Intel Xeon E5 CPUs, and four 16GB NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions that the code is based on "safety starter agents [37], and PID Lagrangian [44]" and that it uses "default parameters for both code bases unless stated otherwise." It also mentions "Python" indirectly in the ethics statement. However, it does not provide specific version numbers for Python, any libraries, or specific software dependencies needed for reproducibility. |
| Experiment Setup | Yes | For PI Simmer we chose the following hyper-parameters K = 0.01, Ki = 0.005, Kaw = 0.01 and τ = 0.995. ... For Q Simmer we chose δ = 1, τ = 0.995, lr = 0.05, and ε = 0.95. ... We have used the same hyper-parameters for all algorithms, which are default parameters in safety starter agents and the learning rate 0.03. ... In all our experiments we used the same hyper-parameters for all versions of PID-L, i.e., K = 0.1, Ki = 0.01, γl = 0.99. |