TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning

Authors: Gregory Farquhar, Tim Rocktäschel, Maximilian Igl, Shimon Whiteson

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Tree QN and ATree C outperform n-step DQN and A2C on a box-pushing task, as well as n-step DQN and value prediction networks (Oh et al., 2017) on multiple Atari games. Furthermore, we present ablation studies that demonstrate the effect of different auxiliary losses on learning transition models.
Researcher Affiliation Academia Gregory Farquhar1 gregory.farquhar@cs.ox.ac.uk Tim Rockt aschel1 tim.rocktaschel@cs.ox.ac.uk Maximilian Igl1 maximilian.igl@cs.ox.ac.uk Shimon Whiteson1 shimon.whiteson@cs.ox.ac.uk 1University of Oxford, United Kingdom
Pseudocode No The paper describes the algorithms in prose and mathematical formulations but does not present any structured pseudocode or algorithm blocks.
Open Source Code No Our implementations are based on Open AI Baselines (Hesse et al., 2017).
Open Datasets Yes Atari. To demonstrate the general applicability of Tree QN and ATree C to complex environments, we evaluate them on the Atari 2600 suite (Bellemare et al., 2013).
Dataset Splits No The paper describes using the Atari 2600 suite and a custom box-pushing environment but does not specify explicit train/validation/test dataset splits. For the box-pushing environment, levels are randomly generated for each episode, and for Atari, no specific dataset splits are mentioned.
Hardware Specification Yes The NVIDIA DGX-1 used for this research was donated by the NVIDIA Corporation.
Software Dependencies No The paper mentions that 'Our implementations are based on Open AI Baselines (Hesse et al., 2017)' and that 'All experiments use RMSProp', but it does not specify version numbers for any software libraries or frameworks.
Experiment Setup Yes Full details of the experimental setup, as well as architecture and training hyperparameters, are given in the appendix. For instance, 'All experiments use RMSProp (Tieleman & Hinton, 2012) with a learning rate of 1e-4, a decay of α = 0.99, and ϵ = 1e-5.' and 'We use nsteps = 5 and nenvs = 16, for a total batch size of 80.'