Reinforcement Learning of Causal Variables Using Mediation Analysis
Authors: Tue Herlau, Rasmus Larsen6910-6917
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the value function recursions in eq. (8) on a simple Markov reward process dubbed TWOSTAGE corresponding to an idealized version of the DOORKEY environment. In TWOSTAGE, the states are divided into two sets SA and SB. The initial state is always in SA, and the environment can either transition within sets (SA SA, SB SB) with a fixed probability, or from set SA to SB, with a fixed probability. From SB, there is a chance to terminate successfully with a reward of +1, and from all states there is a chance to terminate unsuccessfully with a reward of 0. The transition from states in SA to SB, creates a bottleneck distinguishing successful and unsuccessful episodes, much like unlocking the door in the DOORKEY environment. The transition probabilities are chosen such that p(R = 1|s SB) = p(s SB|s SA) = 2 3 and p(R = 1|s SA) = 4 9, see the appendix for further details. To apply algorithm 1 to the DOORKEY environment, we first have to parameterize the states. The environment has |A| = 5 actions, and we consider a fully-observed variant of the environment. We choose the simplest possible encoding, in which each tile, depending on its state, is one-hot encoded as an 11-dimensional vector. This means that an n n environment is encoded as an n n 11-dimensional sparse tensor, and we include a single one-hot encoded feature to account for the player orientation. Further details can be found in the appendix. Episode length is 60 steps. Since the environment encodes orientation, player position and goal position separately, and since specific actions must be used when picking up the key and opening the door, the environment is surprisingly difficult to explore and generalize in. We train an agent using A2C (Mnih et al. 2016) with 1-hidden-layer fully connected neural networks, which results in a completion rate of about 0.25 within the episode limit. We also attempted to train an agent using the Option Critic framework (Bacon, Harb, and Precup 2017), a relevant comparison to our method, but failed to learn options which solved the environment better than chance. After an initial training of πa, we train Φ and πb by maximizing the NIE, using algorithm 1. Training parameters can be found in the supplementary material. To obtain a fair evaluation on separate test data, we simulate the method on 200 random instances of the DOORKEY environment, and use Monte-Carlo roll-outs of the policies πa and πb to estimate the quantities E [Z = 1 | Π = a], E [Z = 1 | Π = b]. This allows us to estimate the NIE on a separate test set. To examine whether the obtained definition of Z is nontrivial, we compare it against a natural alternative that learns Z by maximizing the cross-entropy of Z and Y , Eτ [Y (τ) log P(Z = z|τ)] . (17) Since Y is binary, this corresponds to determining Φ as the binary classifier which separates successful (Y = 1) episodes from unsuccessful episodes (Y = 0), i.e. ensures that the first factor of the NIE eq. (16) is large. The results of both methods can be found in table 1 (results averaged over 10 restarts with different seeds). |
| Researcher Affiliation | Collaboration | Tue Herlau,1 Rasmus Larsen2 1 Technical University of Denmark, 2800 Lyngby, Denmark 2 Alexandra Institute, 2300 Copenhagen, Denmark tuhe@dtu.dk, ralars@dtu.dk |
| Pseudocode | Yes | Algorithm 1: Causal learner |
| Open Source Code | Yes | Code: https://gitlab.compute.dtu.dk/tuhe/causalnie |
| Open Datasets | No | The paper describes using simulated environments (TWOSTAGE and DOORKEY environment) for experiments rather than a publicly available, pre-existing dataset with a provided link or citation for access. |
| Dataset Splits | No | The paper does not provide specific details about training, validation, and test splits for a dataset. It describes simulating environments and then evaluating on "separate test data" from "200 random instances" but does not specify a validation split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments (e.g., CPU/GPU models, memory specifications). |
| Software Dependencies | No | The paper mentions frameworks and methods like "A2C" and "Option Critic framework" but does not specify software versions (e.g., Python 3.x, PyTorch 1.x) for any of its dependencies, which are necessary for full reproducibility. |
| Experiment Setup | No | The paper mentions general aspects like "1-hidden-layer fully connected neural networks" and "Episode length is 60 steps." It also states "Training parameters can be found in the supplementary material," but does not include specific hyperparameters (e.g., learning rate, batch size, optimizer details) in the main text. |