reproducibilityindex.ai

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Authors: Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, Daniel Graves3741-3748

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Empirical Evaluation This section outlines several hypotheses concerning ﬁxedhorizon TD methods, experiments aimed at testing them, and the results from each experiment. Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material.
Researcher Affiliation	Collaboration	1University of Alberta, 2University of Toronto, 3Huawei Technologies Canada, Ltd.
Pseudocode	No	Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material.
Open Source Code	No	No explicit statement or link confirming the release of the paper's source code was found. The mention of 'supplementary material' is not sufficient without a specific declaration of code availability.
Open Datasets	Yes	In Open AI Gym’s Lunar Lander-v2 environment (Brockman et al. 2016), we compared Deep FHQ-learning (DFHQ) with a ﬁnal horizon H = 64 and DQN (Mnih et al. 2015).
Dataset Splits	No	The paper describes experimental runs and evaluations (e.g., 'mean return over the last 10 episodes'), but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts of samples for each split) for the Lunar Lander-v2 environment.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies	No	The paper mentions algorithms and environments (e.g., 'RMSprop', 'Open AI Gym’s Lunar Lander-v2'), but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup	Yes	We restricted the neural network to have two hidden layers, and swept over hidden layer widths for each algorithm. We used γ {0.99, 1.0}, and behaviour was ϵ-greedy with ϵ annealing linearly from 1.0 to 0.1 over 50,000 frames. RMSprop (Tieleman and Hinton 2012) was used on sampled mini-batches from an experience replay buffer (Mnih et al. 2015)