Gradient Temporal-Difference Learning with Regularized Corrections

Authors: Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, Martha White

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically investigate TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show, potentially for the first time, that Gradient TD methods could be a better alternative to TD and Q-learning.
Researcher Affiliation Collaboration 1Amii, Department of Computing Science, University of Alberta. 2Deep Mind, Alberta.
Pseudocode No The paper does not include a figure, block, or section explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code for all experiments is available at: https://github.com/rlai-lab/Regularized-Gradient TD
Open Datasets Yes The first problem, Boyan s chain (Boyan, 2002), is a 13 state Markov chain where each state is represented by a compact feature representation. The second problem is Baird s (1995) well-known star counterexample. In Mountain Car (Moore, 1990; Sutton, 1996), the goal is to reach the top of a hill, with an underpowered car. In Cart Pole (Barto, Sutton & Anderson, 1983), the goal is to keep a pole balanced as long as possible, by moving a cart left or right.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit references to how the data was partitioned for training, validation, and testing) within the main text.
Hardware Specification No The paper mentions training with 'neural networks' and references GPUs in a figure caption (e.g., 'α = 2 x9 10'), but it does not provide specific details such as exact GPU/CPU models, memory, or other hardware specifications used for running the experiments.
Software Dependencies No The paper mentions optimizers like Adagrad and ADAM, and environments like Min Atar, but it does not specify any software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow).
Experiment Setup Yes For all environments, we fix β = 1.0 for QRC, η = 1.0 for QC and do not use target networks (for experiments with target networks see Appendix F). In the two classic control environments, we use 200 runs, an ϵ-greedy policy with ϵ = 0.1 and a discount of γ = 0.99. For the two Min Atar environments, Breakout and Space Invaders, we use 30 runs, γ = 0.99 and a decayed ϵ-greedy policy with ϵ = 1 decaying linearly to ϵ = 0.1 over the first 100,000 steps. All methods use a network with one convolutional layer, followed by a fully connected layer.