reproducibilityindex.ai

Training Characteristic Functions with Reinforcement Learning: XAI-methods play Connect Four

Authors: Stephan Wäldchen, Sebastian Pokutta, Felix Huber

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply this to the game of Connect Four by randomly hiding colour information from our agents during training.
Researcher Affiliation	Academia	TU Berlin & Zuse Institut Berlin.
Pseudocode	No	The paper refers to 'Algorithm 1 ( PPO, Actor-Critic Style ) in (Schulman et al., 2017)' but does not provide its own pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement or link indicating that the source code for the methodology described in the paper is openly available.
Open Datasets	Yes	We apply this to the game of Connect Four... We make use of the fact that neural networks have emerged as one of the strongest models for reinforcement learning, and e.g., constitute the first human competitive models for Go (Silver et al., 2017) and Atari games (Mnih et al., 2015). Our exact setup is illustrated in Figure 1. ...Our training setup is based on Algorithm 1 ( PPO, Actor-Critic Style ) in (Schulman et al., 2017). This setup was applied to Connect Four as described in (Crespo, 2019)... Additionally, we used a game played by two perfect solvers4 and measured how many of the 41 moves were predicted correctly by our agents. For the results see Table 1. ...4Taken from (Pons, 2019) ... an MCTS-agent taken from (Vogt, 2019).
Dataset Splits	No	The paper describes a reinforcement learning setup where agents train through self-play and are then benchmarked. It does not mention explicit training/validation/test dataset splits in the traditional supervised learning sense for hyperparameter tuning or early stopping.
Hardware Specification	No	The paper describes the network architecture used but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) on which the experiments were run.
Software Dependencies	No	The paper mentions using 'Adam on torch standard settings' and references 'Innvestigate' and 'SHAP toolboxes' but does not specify version numbers for PyTorch or the toolboxes.
Experiment Setup	Yes	Our training setup is based on Algorithm 1 ( PPO, Actor-Critic Style ) in (Schulman et al., 2017). This setup was applied to Connect Four as described in (Crespo, 2019), and we adopt most of the hyper-parameters for the training of our agents. ...Network Architecture We use a modified version of the architecture proposed by Crespo with two additional fully connected (FC) layers, described in Figure 3. ...Training Parameters Our PPO-agent plays against itself and for every turn saves state, value output, policy output, reward and an indicator for the last move of the game. We give a reward of 1 for wins, 0 for draws, -1 for losses and -2 for illegal moves. ...We make use of a discount factor γ = 0.75 to propagate back reward to obtain discounted rewards for each state. For clipping the policy loss, we set ϵ = 0.2. The total loss weighs the policy loss with 1.0, the value loss with 0.5 and the entropy loss with 0.01. Every 10 games we update the network parameters with Adam on torch standard settings and a learning rate of l = 0.0001 for 4 steps.