Causality-driven Hierarchical Structure Discovery for Reinforcement Learning
Authors: shaohui peng, Xing Hu, Rui Zhang, Ke Tang, Jiaming Guo, Qi Yi, Ruizhi Chen, xishan zhang, Zidong Du, Ling Li, Qi Guo, Yunji Chen
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results in two complex environments, 2D-Minecraft and Eden, show that CDHRL significantly boosts exploration efficiency with the causality-driven paradigm. We verified our method in two typical complex tasks, including 2d-Minecraft [30] and simplified sandbox survival games Eden [4]. The results show that CDHRL discovers high-quality hierarchical structures and significantly enhances exploration efficiency. |
| Researcher Affiliation | Collaboration | 1SKL of Processors, Institute of Computing Technology, CAS 2University of Chinese Academy of Sciences 3Cambricon Technologies 4Department of Computer Science and Engineering, Southern University of Science and Technology 5SKL of Computer Science, Institute of Software, CAS 6University of Science and Technology of China |
| Pseudocode | Yes | (The pseudo-code of the framework is in Appendix B.1) |
| Open Source Code | Yes | We offer the code of our method in the supplemental material. |
| Open Datasets | Yes | We verified our method in two typical complex tasks, including 2d-Minecraft [30] and simplified sandbox survival games Eden [4] |
| Dataset Splits | No | The paper mentions training stages (pre-training and adaptation) but does not provide specific percentages or counts for training, validation, or test dataset splits. |
| Hardware Specification | No | The paper explicitly states: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]' |
| Software Dependencies | No | The paper mentions implementing the agent based on multi-level DQN with HER and that SCM is implemented like SDI, but it does not specify software versions for libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | The training is divided into two stages, pre-training and adaptation. In the pre-training stage, we train multi-level subgoal-based policies until no new causality is discovered. In the adaptation stage, we train an upper controller on the pre-trained subgoals to maximize the task reward. The SCM is implemented like SDI [16]. We set the initial subgoal distribution as the uniform distribution on GEV GS and pre-trains subgoals with enough steps. To verify whether newly associated variables are controllable as supposed, we compare the final training success rates of the new subgoals with a preset threshold φcausal before adding them to the subgoal hierarchy. |