Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Authors: Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel, Wolfgang Stammer, Kristian Kersting

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results provide evidence of SCo Bots competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCo Bots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it.
Researcher Affiliation Academia Quentin Delfosse ,1 Sebastian Sztwiertnia ,1 Mark Rothermel1 Wolfgang Stammer1,2 Kristian Kersting1,2,3,4 1Computer Science Department, TU Darmstadt, Germany 2Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany 3Centre for Cognitive Science, TU Darmstadt, Germany 4German Research Center for Artificial Intelligence (DFKI), Darmstadt, Germany {firstname.lastname}@cs.tu-darmstadt.de
Pseudocode No The paper describes the architecture and processes verbally and with diagrams (e.g., Figure 2), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code available at https://github.com/k4ntz/SCoBots
Open Datasets Yes We evaluate SCo Bots on 9 Atari games (cf. Fig. 3 from the Atari Learning Environments [Bellemare et al., 2012]) (by far the most used RL framework (cf. App. A.1), as well as the Hack Atari modified [Delfosse et al., 2024a] Pong environments
Dataset Splits Yes Each training seed s performance is evaluated every 500k frames on 4 differently seeded (42+training seed) environments for 8 episodes each. After training, the best performing checkpoint is then ultimately evaluated on 4 seeded (123, 456, 789, 1011) test environments.
Hardware Specification Yes All Experiments were run on a AMD Ryzen 7 processor, 64GB of RAM and one NVIDIA Ge Force RTX 2080 Ti GPU.
Software Dependencies No All agents are trained for 20M frames under the Proximal Policy Optimization algorithm (PPO, [Schulman et al., 2017]), specifically the stable-baseline3 implementation [Raffin et al., 2021] and its default hyperparameters (cf. Tab. 2 in App. A.5). While 'stable-baseline3' is mentioned, a specific version number for the software package itself is not provided.
Experiment Setup Yes All agents are trained for 20M frames under the Proximal Policy Optimization algorithm (PPO, [Schulman et al., 2017]), specifically the stable-baseline3 implementation [Raffin et al., 2021] and its default hyperparameters (cf. Tab. 2 in App. A.5).