Reinforcement Learning of Causal Variables Using Mediation Analysis

Authors: Tue Herlau, Rasmus Larsen6910-6917

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test the value function recursions in eq. (8) on a simple Markov reward process dubbed TWOSTAGE corresponding to an idealized version of the DOORKEY environment. In TWOSTAGE, the states are divided into two sets SA and SB. The initial state is always in SA, and the environment can either transition within sets (SA SA, SB SB) with a fixed probability, or from set SA to SB, with a fixed probability. From SB, there is a chance to terminate successfully with a reward of +1, and from all states there is a chance to terminate unsuccessfully with a reward of 0. The transition from states in SA to SB, creates a bottleneck distinguishing successful and unsuccessful episodes, much like unlocking the door in the DOORKEY environment. The transition probabilities are chosen such that p(R = 1|s SB) = p(s SB|s SA) = 2 3 and p(R = 1|s SA) = 4 9, see the appendix for further details. To apply algorithm 1 to the DOORKEY environment, we first have to parameterize the states. The environment has |A| = 5 actions, and we consider a fully-observed variant of the environment. We choose the simplest possible encoding, in which each tile, depending on its state, is one-hot encoded as an 11-dimensional vector. This means that an n n environment is encoded as an n n 11-dimensional sparse tensor, and we include a single one-hot encoded feature to account for the player orientation. Further details can be found in the appendix. Episode length is 60 steps. Since the environment encodes orientation, player position and goal position separately, and since specific actions must be used when picking up the key and opening the door, the environment is surprisingly difficult to explore and generalize in. We train an agent using A2C (Mnih et al. 2016) with 1-hidden-layer fully connected neural networks, which results in a completion rate of about 0.25 within the episode limit. We also attempted to train an agent using the Option Critic framework (Bacon, Harb, and Precup 2017), a relevant comparison to our method, but failed to learn options which solved the environment better than chance. After an initial training of πa, we train Φ and πb by maximizing the NIE, using algorithm 1. Training parameters can be found in the supplementary material. To obtain a fair evaluation on separate test data, we simulate the method on 200 random instances of the DOORKEY environment, and use Monte-Carlo roll-outs of the policies πa and πb to estimate the quantities E [Z = 1 | Π = a], E [Z = 1 | Π = b]. This allows us to estimate the NIE on a separate test set. To examine whether the obtained definition of Z is nontrivial, we compare it against a natural alternative that learns Z by maximizing the cross-entropy of Z and Y , Eτ [Y (τ) log P(Z = z|τ)] . (17) Since Y is binary, this corresponds to determining Φ as the binary classifier which separates successful (Y = 1) episodes from unsuccessful episodes (Y = 0), i.e. ensures that the first factor of the NIE eq. (16) is large. The results of both methods can be found in table 1 (results averaged over 10 restarts with different seeds).
Researcher Affiliation Collaboration Tue Herlau,1 Rasmus Larsen2 1 Technical University of Denmark, 2800 Lyngby, Denmark 2 Alexandra Institute, 2300 Copenhagen, Denmark tuhe@dtu.dk, ralars@dtu.dk
Pseudocode Yes Algorithm 1: Causal learner
Open Source Code Yes Code: https://gitlab.compute.dtu.dk/tuhe/causalnie
Open Datasets No The paper describes using simulated environments (TWOSTAGE and DOORKEY environment) for experiments rather than a publicly available, pre-existing dataset with a provided link or citation for access.
Dataset Splits No The paper does not provide specific details about training, validation, and test splits for a dataset. It describes simulating environments and then evaluating on "separate test data" from "200 random instances" but does not specify a validation split.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments (e.g., CPU/GPU models, memory specifications).
Software Dependencies No The paper mentions frameworks and methods like "A2C" and "Option Critic framework" but does not specify software versions (e.g., Python 3.x, PyTorch 1.x) for any of its dependencies, which are necessary for full reproducibility.
Experiment Setup No The paper mentions general aspects like "1-hidden-layer fully connected neural networks" and "Episode length is 60 steps." It also states "Training parameters can be found in the supplementary material," but does not include specific hyperparameters (e.g., learning rate, batch size, optimizer details) in the main text.