Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Authors: Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, Daniel Graves3741-3748

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Empirical Evaluation This section outlines several hypotheses concerning fixedhorizon TD methods, experiments aimed at testing them, and the results from each experiment. Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material.
Researcher Affiliation Collaboration 1University of Alberta, 2University of Toronto, 3Huawei Technologies Canada, Ltd.
Pseudocode No Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material.
Open Source Code No No explicit statement or link confirming the release of the paper's source code was found. The mention of 'supplementary material' is not sufficient without a specific declaration of code availability.
Open Datasets Yes In Open AI Gym’s Lunar Lander-v2 environment (Brockman et al. 2016), we compared Deep FHQ-learning (DFHQ) with a final horizon H = 64 and DQN (Mnih et al. 2015).
Dataset Splits No The paper describes experimental runs and evaluations (e.g., 'mean return over the last 10 episodes'), but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts of samples for each split) for the Lunar Lander-v2 environment.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper mentions algorithms and environments (e.g., 'RMSprop', 'Open AI Gym’s Lunar Lander-v2'), but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes We restricted the neural network to have two hidden layers, and swept over hidden layer widths for each algorithm. We used γ {0.99, 1.0}, and behaviour was ϵ-greedy with ϵ annealing linearly from 1.0 to 0.1 over 50,000 frames. RMSprop (Tieleman and Hinton 2012) was used on sampled mini-batches from an experience replay buffer (Mnih et al. 2015)