An Efficient Approach to Model-Based Hierarchical Reinforcement Learning

Authors: Zhuoru Li, Akshay Narayan, Tze-Yun Leong

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test the framework on common benchmark problems and complex simulated robotic environments. It compares favorably against the stateof-the-art algorithms, and scales well in very large problems. Experiments We test the empirical performance of CSRL on a set of benchmark experiments, formulated as a robot HRL agent solving different tasks.
Researcher Affiliation Collaboration School of Computing, National University of Singapore School of Information Systems, Singapore Management University lizhuoru@google.com, {anarayan, leongty}@comp.nus.edu.sg, leongty@smu.edu.sg Currently affiliated with Google Korea, LLC
Pseudocode Yes Algorithm 1 CSRL Algorithm, Algorithm 2 Construct SMDP(current_state), Algorithm 3 Simulate Task(s, i)
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes This is a variant of the HRL benchmark Taxi problem (Dietterich 1998). We use the 10x10 grid world from Diuk et al. (Diuk, Cohen, and Littman 2008).
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits or percentages. It discusses experiments in terms of 'episodes' for reinforcement learning.
Hardware Specification Yes The running time is the average of 10 independent runs, on a Xeon E5-2643 v2 3.50GHz using a single thread.
Software Dependencies No The paper mentions 'Webots (Michel 2004) simulator' but does not provide specific version numbers for it or any other software libraries or dependencies used.
Experiment Setup Yes In all experiments, an episode terminates if it does not complete in 1000 steps. We set the exploration threshold, m = 1 for all methods. Since R-MAXQ cannot converge with m = 1, we set m = 5 like other existing works (Jong and Stone 2008; Cao and Ray 2012). The reward for navigation actions and opening doors is -1. The reward for the tasks unique actions is 40 if it completes the task, and -5 for attempting actions at wrong locations.