Training Characteristic Functions with Reinforcement Learning: XAI-methods play Connect Four
Authors: Stephan Wäldchen, Sebastian Pokutta, Felix Huber
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply this to the game of Connect Four by randomly hiding colour information from our agents during training. |
| Researcher Affiliation | Academia | TU Berlin & Zuse Institut Berlin. |
| Pseudocode | No | The paper refers to 'Algorithm 1 ( PPO, Actor-Critic Style ) in (Schulman et al., 2017)' but does not provide its own pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the methodology described in the paper is openly available. |
| Open Datasets | Yes | We apply this to the game of Connect Four... We make use of the fact that neural networks have emerged as one of the strongest models for reinforcement learning, and e.g., constitute the first human competitive models for Go (Silver et al., 2017) and Atari games (Mnih et al., 2015). Our exact setup is illustrated in Figure 1. ...Our training setup is based on Algorithm 1 ( PPO, Actor-Critic Style ) in (Schulman et al., 2017). This setup was applied to Connect Four as described in (Crespo, 2019)... Additionally, we used a game played by two perfect solvers4 and measured how many of the 41 moves were predicted correctly by our agents. For the results see Table 1. ...4Taken from (Pons, 2019) ... an MCTS-agent taken from (Vogt, 2019). |
| Dataset Splits | No | The paper describes a reinforcement learning setup where agents train through self-play and are then benchmarked. It does not mention explicit training/validation/test dataset splits in the traditional supervised learning sense for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper describes the network architecture used but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) on which the experiments were run. |
| Software Dependencies | No | The paper mentions using 'Adam on torch standard settings' and references 'Innvestigate' and 'SHAP toolboxes' but does not specify version numbers for PyTorch or the toolboxes. |
| Experiment Setup | Yes | Our training setup is based on Algorithm 1 ( PPO, Actor-Critic Style ) in (Schulman et al., 2017). This setup was applied to Connect Four as described in (Crespo, 2019), and we adopt most of the hyper-parameters for the training of our agents. ...Network Architecture We use a modified version of the architecture proposed by Crespo with two additional fully connected (FC) layers, described in Figure 3. ...Training Parameters Our PPO-agent plays against itself and for every turn saves state, value output, policy output, reward and an indicator for the last move of the game. We give a reward of 1 for wins, 0 for draws, -1 for losses and -2 for illegal moves. ...We make use of a discount factor γ = 0.75 to propagate back reward to obtain discounted rewards for each state. For clipping the policy loss, we set ϵ = 0.2. The total loss weighs the policy loss with 1.0, the value loss with 0.5 and the entropy loss with 0.01. Every 10 games we update the network parameters with Adam on torch standard settings and a learning rate of l = 0.0001 for 4 steps. |