Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning
Authors: Wenhao Ding, Haohong Lin, Bo Li, DING ZHAO
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of the proposed method, we conduct comprehensive experiments in environments that require strong reasoning capability. Specifically, we design two types of generalization settings, i.e., spuriousness and composition, and provide an example to illustrate these settings in Figure 1. The evaluation results confirm the advantages of our method in two aspects. First, the proposed data-efficient discovery method provides an explainable causal graph yet requires much fewer data than previous methods, increasing data efficiency and interpretability during task solving. Second, simultaneously discovering the causal graph during policy learning dramatically increases the success rate of solving tasks. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2University of Illinois Urbana-Champaign {wenhaod, haohongl}@andrew.cmu.edu, lbo@illinois.edu, dingzhao@cmu.edu |
| Pseudocode | Yes | Algorithm 1: GRADER Training |
| Open Source Code | Yes | Code is available on https://github.com/GilgameshD/GRADER. |
| Open Datasets | Yes | We design three new environments, which are shown in Figure 3 (excluding Chemistry [43]). These environments use the true state as observation to disentangle the reasoning task from visual understanding. ... Chemistry [43]: There are 10 nodes with different colors. |
| Dataset Splits | No | The paper describes training and testing settings (ptrain(g) and ptest(g)) but does not specify numerical dataset splits for training, validation, or testing (e.g., 80/10/10 split). |
| Hardware Specification | Yes | We use a workstation equipped with NVIDIA RTX 3090 GPUs. ... Total compute: 10 RTX 3090 GPU days. |
| Software Dependencies | Yes | All experiments are implemented using PyTorch (version 1.10.1) and Python (version 3.8.8). |
| Experiment Setup | Yes | Appendix C.2.2 Hyperparameters: For all environments, we use the learning rate 0.0001, batch size 64, number of epochs 50, and use Adam optimizer for transition model learning. For policy learning, we use horizon H = 10, discount factor γ = 0.99, and the number of action samples for random shooting is 100. |