Explainable Reinforcement Learning via a Causal World Model

Authors: Zhongwei Yu, Jingqing Ruan, Dengpeng Xing

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present examples of causal chains in two representative environments: Lunarlander-Continuous for the continuous action space, and Build-Maine for the discrete action space. To verify whether our approach can produce correct causal chains, we design an environment to measure the accuracy of recovering causal dependencies of the ground-truth AIM. To evaluate the performance of our model in MBRL, we perform experiments in two extra environments: Cartpole and Lunarlander-Discrete.
Researcher Affiliation Academia Institute of Automation, Chinese Academy of Sciences {yuzhongwei2021, ruanjingqing2019, dengpeng.xing}@ia.ac.cn
Pseudocode Yes The pseudo-code of the learning procedure is given in Appendix D.
Open Source Code Yes Our source code is available at https://github.com/Ease Onway/Explainable Causal-Reinforcement-Learning.
Open Datasets Yes The Build-Marine environment is adapted from one of the Start Craft II mini-games in SC2LE [Samvelyan et al., 2019]; the Cartpole and Lunarlander environments are classic control problems provided by Open AI Gym [Brockman et al., 2016].
Dataset Splits No The paper describes collecting transition data into a buffer for training and updating the model but does not specify exact train/validation/test dataset splits or percentages.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using "Proximal Policy Optimization" and refers to a model that is "learned using PyTorch" in Appendix E.2, but it does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We first use the policy (with noise) to collect 150k samples into the buffer D. Then, we use these samples to discover the causal graph (with the threshold η = 0.05) and train the inference networks.