Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Authors: Tejas D. Kulkarni, Karthik Narasimhan, Ardavan Saeedi, Josh Tenenbaum

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the strength of our approach on two problems with very sparse and delayed feedback: (1) a complex discrete stochastic decision process with stochastic transitions, and (2) the classic ATARI game Montezuma s Revenge .4 Experiments
Researcher Affiliation Collaboration Tejas D. Kulkarni Deep Mind, London tejasdkulkarni@gmail.com Karthik R. Narasimhan CSAIL, MIT karthikn@mit.edu Ardavan Saeedi CSAIL, MIT ardavans@mit.edu Joshua B. Tenenbaum BCS, MIT jbt@mit.edu
Pseudocode Yes Algorithm 1 Learning algorithm for h-DQN, Algorithm 2 : EPSGREEDY(x, B, ϵ, Q), Algorithm 3 : UPDATEPARAMS(L, D)
Open Source Code No The paper includes a footnote 2Sample trajectory of a run on Montezuma s Revenge https://goo.gl/3Z64Ji which links to a video, but does not provide any explicit statement or link for the open-source code of the described methodology.
Open Datasets Yes We use the Arcade Learning Environment [3] to perform experiments. [3] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2012.
Dataset Splits No The paper specifies the sizes for experience replay memories (D1 and D2 were set to be equal to 10^6 and 5 10^4 respectively) but does not provide explicit training, validation, or test dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using Deep Q-Learning framework and convolutional neural networks but does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes all ϵ parameters are annealed from 1 to 0.1 over 50k steps. The learning rate is set to 2.5 10^4. The experience replay memories D1 and D2 were set to be equal to 10^6 and 5 10^4 respectively. We set the learning rate to be 2.5 10^4, with a discount rate of 0.99.