reproducibilityindex.ai

Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning

Authors: Wenhao Ding, Haohong Lin, Bo Li, DING ZHAO

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the effectiveness of the proposed method, we conduct comprehensive experiments in environments that require strong reasoning capability. Specifically, we design two types of generalization settings, i.e., spuriousness and composition, and provide an example to illustrate these settings in Figure 1. The evaluation results confirm the advantages of our method in two aspects. First, the proposed data-efficient discovery method provides an explainable causal graph yet requires much fewer data than previous methods, increasing data efficiency and interpretability during task solving. Second, simultaneously discovering the causal graph during policy learning dramatically increases the success rate of solving tasks.
Researcher Affiliation	Academia	1Carnegie Mellon University 2University of Illinois Urbana-Champaign {wenhaod, haohongl}@andrew.cmu.edu, lbo@illinois.edu, dingzhao@cmu.edu
Pseudocode	Yes	Algorithm 1: GRADER Training
Open Source Code	Yes	Code is available on https://github.com/GilgameshD/GRADER.
Open Datasets	Yes	We design three new environments, which are shown in Figure 3 (excluding Chemistry [43]). These environments use the true state as observation to disentangle the reasoning task from visual understanding. ... Chemistry [43]: There are 10 nodes with different colors.
Dataset Splits	No	The paper describes training and testing settings (ptrain(g) and ptest(g)) but does not specify numerical dataset splits for training, validation, or testing (e.g., 80/10/10 split).
Hardware Specification	Yes	We use a workstation equipped with NVIDIA RTX 3090 GPUs. ... Total compute: 10 RTX 3090 GPU days.
Software Dependencies	Yes	All experiments are implemented using PyTorch (version 1.10.1) and Python (version 3.8.8).
Experiment Setup	Yes	Appendix C.2.2 Hyperparameters: For all environments, we use the learning rate 0.0001, batch size 64, number of epochs 50, and use Adam optimizer for transition model learning. For policy learning, we use horizon H = 10, discount factor γ = 0.99, and the number of action samples for random shooting is 100.