reproducibilityindex.ai

Gradient Temporal-Difference Learning with Regularized Corrections

Authors: Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, Martha White

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically investigate TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show, potentially for the ﬁrst time, that Gradient TD methods could be a better alternative to TD and Q-learning.
Researcher Affiliation	Collaboration	1Amii, Department of Computing Science, University of Alberta. 2Deep Mind, Alberta.
Pseudocode	No	The paper does not include a figure, block, or section explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code for all experiments is available at: https://github.com/rlai-lab/Regularized-Gradient TD
Open Datasets	Yes	The first problem, Boyan s chain (Boyan, 2002), is a 13 state Markov chain where each state is represented by a compact feature representation. The second problem is Baird s (1995) well-known star counterexample. In Mountain Car (Moore, 1990; Sutton, 1996), the goal is to reach the top of a hill, with an underpowered car. In Cart Pole (Barto, Sutton & Anderson, 1983), the goal is to keep a pole balanced as long as possible, by moving a cart left or right.
Dataset Splits	No	The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit references to how the data was partitioned for training, validation, and testing) within the main text.
Hardware Specification	No	The paper mentions training with 'neural networks' and references GPUs in a figure caption (e.g., 'α = 2 x9 10'), but it does not provide specific details such as exact GPU/CPU models, memory, or other hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions optimizers like Adagrad and ADAM, and environments like Min Atar, but it does not specify any software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow).
Experiment Setup	Yes	For all environments, we fix β = 1.0 for QRC, η = 1.0 for QC and do not use target networks (for experiments with target networks see Appendix F). In the two classic control environments, we use 200 runs, an ϵ-greedy policy with ϵ = 0.1 and a discount of γ = 0.99. For the two Min Atar environments, Breakout and Space Invaders, we use 30 runs, γ = 0.99 and a decayed ϵ-greedy policy with ϵ = 1 decaying linearly to ϵ = 0.1 over the ﬁrst 100,000 steps. All methods use a network with one convolutional layer, followed by a fully connected layer.