I²HRL: Interactive Influence-based Hierarchical Reinforcement Learning
Authors: Rundong Wang, Runsheng Yu, Bo An, Zinovi Rabinovich
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate the effectiveness of the proposed solution in several tasks in Mu Jo Co domains by demonstrating that our approach can significantly boost the learning performance and accelerate learning compared with stateof-the-art HRL methods. |
| Researcher Affiliation | Academia | School of Computer Science and Engineering, Nanyang Technological University, Singapore |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | Yes | We evaluate and analyze our methods in the benchmarking hierarchical tasks [Duan et al., 2016]. These environments were all simulated using the Mujo Co physics engine for modelbased control. The tasks are as follows: Ant Gather. Ant Maze. Ant Push. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Both levels of I2HRL utilize TD3. The low-level and high-level critic updates every single step and every 10 steps respectively. The low-level and high-level actor updates every 2 steps and every 20 steps respectively. We use Adam optimizer with learning rate of 3e 4 for actor and critic of both levels of policies. We set the high-level policy decision interval k and the length of trajectories for low-level policy represent c as 10. Discount γ = 0.99, replay buffer size is 200, 000 for both levels of policies. The method-specific hyper-parameters (β and βr) are fine-tuned for each tasks. |