Adversarial Behavior Exclusion for Safe Reinforcement Learning
Authors: Md Asifur Rahman, Tongtong Liu, Sarra Alqahtani
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the robustness of Adv Ex-RL via comprehensive experiments in standard constrained Markov decision processes (CMDP) environments under 2 white-box action space perturbations as well as with changes in environment dynamics against 7 baselines. |
| Researcher Affiliation | Academia | Md Asifur Rahman , Tongtong Liu, and Sarra Alqahtani Department of Computer Science, Wake Forest University {rahmm21, liut18, sarra-alqahtani}@wfu.edu |
| Pseudocode | Yes | Details on the training of the adversarial policy are given in Appendix1 A, Algorithm 1.) and (More details on Adv Ex-RL safety policy training can be found in Appendix B, Algorithm 2). and (Algorithm 3 in Appendix D shows the online execution of Adv Ex-RL.) |
| Open Source Code | Yes | All the codes2 relevant to the experiments are available online. 2https://github.com/asifurrahman1/Adv Ex-RL |
| Open Datasets | Yes | We conducted our experiments on three continuous Mu Jo Co CMDPs [Thananjeyan et al., 2021] (i) Maze (ii) Navigation 1, and (iii) Navigation 2. In addition, we also conducted experiments on Safety Gym environments [Ray et al., 2019]. |
| Dataset Splits | No | The paper mentions conducting experiments in 'training environments' and 'testing environments' (e.g., '10 times more variations in the testing environment dynamics' and 'averaged over 100 test episodes'), but it does not provide specific numerical dataset split information (percentages or sample counts) for training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions using SAC (Soft Actor-Critic) for training policies, and conducting experiments in MuJoCo CMDPs and Safety Gym environments, but it does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Tsafety is a predefined threshold value such that at any state st and for any action at πtask(st); if Shield(st, at) is triggered, then the Adv Ex-RL safety firewall replaces the selected action at by a safer action given by the safety policy asafe t πsafety(st). The value of Tsafety is environment-specific and can be chosen based on a sensitivity test for each environment (see Appendix C for details about the sensitivity test. Algorithm 3 in Appendix D shows the online execution of Adv Ex-RL.) and In addition, see Appendix G for further implementation details of Adv Ex-RL and the baselines. |