reproducibilityindex.ai

Adversarial Behavior Exclusion for Safe Reinforcement Learning

Authors: Md Asifur Rahman, Tongtong Liu, Sarra Alqahtani

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the robustness of Adv Ex-RL via comprehensive experiments in standard constrained Markov decision processes (CMDP) environments under 2 white-box action space perturbations as well as with changes in environment dynamics against 7 baselines.
Researcher Affiliation	Academia	Md Asifur Rahman , Tongtong Liu, and Sarra Alqahtani Department of Computer Science, Wake Forest University {rahmm21, liut18, sarra-alqahtani}@wfu.edu
Pseudocode	Yes	Details on the training of the adversarial policy are given in Appendix1 A, Algorithm 1.) and (More details on Adv Ex-RL safety policy training can be found in Appendix B, Algorithm 2). and (Algorithm 3 in Appendix D shows the online execution of Adv Ex-RL.)
Open Source Code	Yes	All the codes2 relevant to the experiments are available online. 2https://github.com/asifurrahman1/Adv Ex-RL
Open Datasets	Yes	We conducted our experiments on three continuous Mu Jo Co CMDPs [Thananjeyan et al., 2021] (i) Maze (ii) Navigation 1, and (iii) Navigation 2. In addition, we also conducted experiments on Safety Gym environments [Ray et al., 2019].
Dataset Splits	No	The paper mentions conducting experiments in 'training environments' and 'testing environments' (e.g., '10 times more variations in the testing environment dynamics' and 'averaged over 100 test episodes'), but it does not provide specific numerical dataset split information (percentages or sample counts) for training, validation, or test sets.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper mentions using SAC (Soft Actor-Critic) for training policies, and conducting experiments in MuJoCo CMDPs and Safety Gym environments, but it does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	Tsafety is a predefined threshold value such that at any state st and for any action at πtask(st); if Shield(st, at) is triggered, then the Adv Ex-RL safety firewall replaces the selected action at by a safer action given by the safety policy asafe t πsafety(st). The value of Tsafety is environment-specific and can be chosen based on a sensitivity test for each environment (see Appendix C for details about the sensitivity test. Algorithm 3 in Appendix D shows the online execution of Adv Ex-RL.) and In addition, see Appendix G for further implementation details of Adv Ex-RL and the baselines.