Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
Authors: Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, Daniel Graves3741-3748
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Empirical Evaluation This section outlines several hypotheses concerning fixedhorizon TD methods, experiments aimed at testing them, and the results from each experiment. Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material. |
| Researcher Affiliation | Collaboration | 1University of Alberta, 2University of Toronto, 3Huawei Technologies Canada, Ltd. |
| Pseudocode | No | Pseudo-code, diagrams, more experimental details, and additional experiments can be found in the supplementary material. |
| Open Source Code | No | No explicit statement or link confirming the release of the paper's source code was found. The mention of 'supplementary material' is not sufficient without a specific declaration of code availability. |
| Open Datasets | Yes | In Open AI Gym’s Lunar Lander-v2 environment (Brockman et al. 2016), we compared Deep FHQ-learning (DFHQ) with a final horizon H = 64 and DQN (Mnih et al. 2015). |
| Dataset Splits | No | The paper describes experimental runs and evaluations (e.g., 'mean return over the last 10 episodes'), but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts of samples for each split) for the Lunar Lander-v2 environment. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | No | The paper mentions algorithms and environments (e.g., 'RMSprop', 'Open AI Gym’s Lunar Lander-v2'), but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | We restricted the neural network to have two hidden layers, and swept over hidden layer widths for each algorithm. We used γ {0.99, 1.0}, and behaviour was ϵ-greedy with ϵ annealing linearly from 1.0 to 0.1 over 50,000 frames. RMSprop (Tieleman and Hinton 2012) was used on sampled mini-batches from an experience replay buffer (Mnih et al. 2015) |