An Efficient Approach to Model-Based Hierarchical Reinforcement Learning
Authors: Zhuoru Li, Akshay Narayan, Tze-Yun Leong
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the framework on common benchmark problems and complex simulated robotic environments. It compares favorably against the stateof-the-art algorithms, and scales well in very large problems. Experiments We test the empirical performance of CSRL on a set of benchmark experiments, formulated as a robot HRL agent solving different tasks. |
| Researcher Affiliation | Collaboration | School of Computing, National University of Singapore School of Information Systems, Singapore Management University lizhuoru@google.com, {anarayan, leongty}@comp.nus.edu.sg, leongty@smu.edu.sg Currently affiliated with Google Korea, LLC |
| Pseudocode | Yes | Algorithm 1 CSRL Algorithm, Algorithm 2 Construct SMDP(current_state), Algorithm 3 Simulate Task(s, i) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | This is a variant of the HRL benchmark Taxi problem (Dietterich 1998). We use the 10x10 grid world from Diuk et al. (Diuk, Cohen, and Littman 2008). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits or percentages. It discusses experiments in terms of 'episodes' for reinforcement learning. |
| Hardware Specification | Yes | The running time is the average of 10 independent runs, on a Xeon E5-2643 v2 3.50GHz using a single thread. |
| Software Dependencies | No | The paper mentions 'Webots (Michel 2004) simulator' but does not provide specific version numbers for it or any other software libraries or dependencies used. |
| Experiment Setup | Yes | In all experiments, an episode terminates if it does not complete in 1000 steps. We set the exploration threshold, m = 1 for all methods. Since R-MAXQ cannot converge with m = 1, we set m = 5 like other existing works (Jong and Stone 2008; Cao and Ray 2012). The reward for navigation actions and opening doors is -1. The reward for the tasks unique actions is 40 if it completes the task, and -5 for attempting actions at wrong locations. |