Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning

Authors: Wenhao Ding, Haohong Lin, Bo Li, DING ZHAO

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of the proposed method, we conduct comprehensive experiments in environments that require strong reasoning capability. Specifically, we design two types of generalization settings, i.e., spuriousness and composition, and provide an example to illustrate these settings in Figure 1. The evaluation results confirm the advantages of our method in two aspects. First, the proposed data-efficient discovery method provides an explainable causal graph yet requires much fewer data than previous methods, increasing data efficiency and interpretability during task solving. Second, simultaneously discovering the causal graph during policy learning dramatically increases the success rate of solving tasks.
Researcher Affiliation Academia 1Carnegie Mellon University 2University of Illinois Urbana-Champaign {wenhaod, haohongl}@andrew.cmu.edu, lbo@illinois.edu, dingzhao@cmu.edu
Pseudocode Yes Algorithm 1: GRADER Training
Open Source Code Yes Code is available on https://github.com/GilgameshD/GRADER.
Open Datasets Yes We design three new environments, which are shown in Figure 3 (excluding Chemistry [43]). These environments use the true state as observation to disentangle the reasoning task from visual understanding. ... Chemistry [43]: There are 10 nodes with different colors.
Dataset Splits No The paper describes training and testing settings (ptrain(g) and ptest(g)) but does not specify numerical dataset splits for training, validation, or testing (e.g., 80/10/10 split).
Hardware Specification Yes We use a workstation equipped with NVIDIA RTX 3090 GPUs. ... Total compute: 10 RTX 3090 GPU days.
Software Dependencies Yes All experiments are implemented using PyTorch (version 1.10.1) and Python (version 3.8.8).
Experiment Setup Yes Appendix C.2.2 Hyperparameters: For all environments, we use the learning rate 0.0001, batch size 64, number of epochs 50, and use Adam optimizer for transition model learning. For policy learning, we use horizon H = 10, discount factor γ = 0.99, and the number of action samples for random shooting is 100.