Safe Reinforcement Learning via Shielding

Authors: Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, Ufuk Topcu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied shielded RL in four domains: (1) a robot in a grid world, (2) a self-driving car scenario, (3) the water tank scenario from Example 1, and (4) the pacman example. The results show that only the unshielded versions experience negative rewards. Furthermore, the shielded version is not only safe, but also tends to learn more rapidly.
Researcher Affiliation Collaboration 1University of Texas at Austin, 210 East 24th Street, Austin, Texas 78712, USA 2Graz University of Technology, Rechbauerstraße 12, 8010 Graz, Austria 3University of Bremen and DFKI Gmb H, Bibliothekstraße 1, 28359 Bremen, Deutschland
Pseudocode No The paper describes algorithmic steps in paragraph text but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Source code, input files, and detailed instructions to reproduce our experiments are available for download.1 1https://github.com/safe-rl/safe-rl-shielding
Open Datasets No The paper describes custom simulation environments (e.g., 'grid world', 'self-driving car scenario', 'water tank scenario', 'pacman example') but does not provide concrete access information or citations to publicly available datasets.
Dataset Splits No The paper does not provide specific training, validation, or test dataset split percentages or sample counts for its experiments.
Hardware Specification Yes The simulations were performed on a computer equipped with an Intel R Core TMi7-4790K and 16 GB of RAM running a 64-bit version of Ubuntu R 16.04 LTS.
Software Dependencies No The paper mentions the operating system ('Ubuntu R 16.04 LTS') but does not specify other key software components or libraries with version numbers required for replication.
Experiment Setup Yes The agent uses tabular Q-learning with an ϵ-greedy explorer that is capable of multiple policy updates at once. The agent uses a Deep Q-Network with a Boltzmann exploration policy. This network consists of 4 input nodes, 8 outputs nodes and 3 hidden layers. In each step, a positive reward is given if the car moves a step in a clockwise direction along its track and a penalty is given if it moves in a counter-clockwise direction. A crash into the wall results in a penalty and a restart. We compare between no shielding (red, dashed), no shielding with large penalties for unsafe actions (black, solid), and a |rankt| = 3 shielding with penalties for corrected actions (black, solid).