reproducibilityindex.ai

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Authors: Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel, Wolfgang Stammer, Kristian Kersting

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results provide evidence of SCo Bots competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCo Bots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it.
Researcher Affiliation	Academia	Quentin Delfosse ,1 Sebastian Sztwiertnia ,1 Mark Rothermel1 Wolfgang Stammer1,2 Kristian Kersting1,2,3,4 1Computer Science Department, TU Darmstadt, Germany 2Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany 3Centre for Cognitive Science, TU Darmstadt, Germany 4German Research Center for Artificial Intelligence (DFKI), Darmstadt, Germany {firstname.lastname}@cs.tu-darmstadt.de
Pseudocode	No	The paper describes the architecture and processes verbally and with diagrams (e.g., Figure 2), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Code available at https://github.com/k4ntz/SCoBots
Open Datasets	Yes	We evaluate SCo Bots on 9 Atari games (cf. Fig. 3 from the Atari Learning Environments [Bellemare et al., 2012]) (by far the most used RL framework (cf. App. A.1), as well as the Hack Atari modified [Delfosse et al., 2024a] Pong environments
Dataset Splits	Yes	Each training seed s performance is evaluated every 500k frames on 4 differently seeded (42+training seed) environments for 8 episodes each. After training, the best performing checkpoint is then ultimately evaluated on 4 seeded (123, 456, 789, 1011) test environments.
Hardware Specification	Yes	All Experiments were run on a AMD Ryzen 7 processor, 64GB of RAM and one NVIDIA Ge Force RTX 2080 Ti GPU.
Software Dependencies	No	All agents are trained for 20M frames under the Proximal Policy Optimization algorithm (PPO, [Schulman et al., 2017]), specifically the stable-baseline3 implementation [Raffin et al., 2021] and its default hyperparameters (cf. Tab. 2 in App. A.5). While 'stable-baseline3' is mentioned, a specific version number for the software package itself is not provided.
Experiment Setup	Yes	All agents are trained for 20M frames under the Proximal Policy Optimization algorithm (PPO, [Schulman et al., 2017]), specifically the stable-baseline3 implementation [Raffin et al., 2021] and its default hyperparameters (cf. Tab. 2 in App. A.5).