Explainable Reinforcement Learning via a Causal World Model
Authors: Zhongwei Yu, Jingqing Ruan, Dengpeng Xing
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present examples of causal chains in two representative environments: Lunarlander-Continuous for the continuous action space, and Build-Maine for the discrete action space. To verify whether our approach can produce correct causal chains, we design an environment to measure the accuracy of recovering causal dependencies of the ground-truth AIM. To evaluate the performance of our model in MBRL, we perform experiments in two extra environments: Cartpole and Lunarlander-Discrete. |
| Researcher Affiliation | Academia | Institute of Automation, Chinese Academy of Sciences {yuzhongwei2021, ruanjingqing2019, dengpeng.xing}@ia.ac.cn |
| Pseudocode | Yes | The pseudo-code of the learning procedure is given in Appendix D. |
| Open Source Code | Yes | Our source code is available at https://github.com/Ease Onway/Explainable Causal-Reinforcement-Learning. |
| Open Datasets | Yes | The Build-Marine environment is adapted from one of the Start Craft II mini-games in SC2LE [Samvelyan et al., 2019]; the Cartpole and Lunarlander environments are classic control problems provided by Open AI Gym [Brockman et al., 2016]. |
| Dataset Splits | No | The paper describes collecting transition data into a buffer for training and updating the model but does not specify exact train/validation/test dataset splits or percentages. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "Proximal Policy Optimization" and refers to a model that is "learned using PyTorch" in Appendix E.2, but it does not specify version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We first use the policy (with noise) to collect 150k samples into the buffer D. Then, we use these samples to discover the causal graph (with the threshold η = 0.05) and train the inference networks. |