Safe Reinforcement Learning via Shielding
Authors: Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, Ufuk Topcu
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We applied shielded RL in four domains: (1) a robot in a grid world, (2) a self-driving car scenario, (3) the water tank scenario from Example 1, and (4) the pacman example. The results show that only the unshielded versions experience negative rewards. Furthermore, the shielded version is not only safe, but also tends to learn more rapidly. |
| Researcher Affiliation | Collaboration | 1University of Texas at Austin, 210 East 24th Street, Austin, Texas 78712, USA 2Graz University of Technology, Rechbauerstraße 12, 8010 Graz, Austria 3University of Bremen and DFKI Gmb H, Bibliothekstraße 1, 28359 Bremen, Deutschland |
| Pseudocode | No | The paper describes algorithmic steps in paragraph text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code, input files, and detailed instructions to reproduce our experiments are available for download.1 1https://github.com/safe-rl/safe-rl-shielding |
| Open Datasets | No | The paper describes custom simulation environments (e.g., 'grid world', 'self-driving car scenario', 'water tank scenario', 'pacman example') but does not provide concrete access information or citations to publicly available datasets. |
| Dataset Splits | No | The paper does not provide specific training, validation, or test dataset split percentages or sample counts for its experiments. |
| Hardware Specification | Yes | The simulations were performed on a computer equipped with an Intel R Core TMi7-4790K and 16 GB of RAM running a 64-bit version of Ubuntu R 16.04 LTS. |
| Software Dependencies | No | The paper mentions the operating system ('Ubuntu R 16.04 LTS') but does not specify other key software components or libraries with version numbers required for replication. |
| Experiment Setup | Yes | The agent uses tabular Q-learning with an ϵ-greedy explorer that is capable of multiple policy updates at once. The agent uses a Deep Q-Network with a Boltzmann exploration policy. This network consists of 4 input nodes, 8 outputs nodes and 3 hidden layers. In each step, a positive reward is given if the car moves a step in a clockwise direction along its track and a penalty is given if it moves in a counter-clockwise direction. A crash into the wall results in a penalty and a restart. We compare between no shielding (red, dashed), no shielding with large penalties for unsafe actions (black, solid), and a |rankt| = 3 shielding with penalties for corrected actions (black, solid). |