Gradient Temporal-Difference Learning with Regularized Corrections
Authors: Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, Martha White
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically investigate TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show, potentially for the first time, that Gradient TD methods could be a better alternative to TD and Q-learning. |
| Researcher Affiliation | Collaboration | 1Amii, Department of Computing Science, University of Alberta. 2Deep Mind, Alberta. |
| Pseudocode | No | The paper does not include a figure, block, or section explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code for all experiments is available at: https://github.com/rlai-lab/Regularized-Gradient TD |
| Open Datasets | Yes | The first problem, Boyan s chain (Boyan, 2002), is a 13 state Markov chain where each state is represented by a compact feature representation. The second problem is Baird s (1995) well-known star counterexample. In Mountain Car (Moore, 1990; Sutton, 1996), the goal is to reach the top of a hill, with an underpowered car. In Cart Pole (Barto, Sutton & Anderson, 1983), the goal is to keep a pole balanced as long as possible, by moving a cart left or right. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit references to how the data was partitioned for training, validation, and testing) within the main text. |
| Hardware Specification | No | The paper mentions training with 'neural networks' and references GPUs in a figure caption (e.g., 'α = 2 x9 10'), but it does not provide specific details such as exact GPU/CPU models, memory, or other hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adagrad and ADAM, and environments like Min Atar, but it does not specify any software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow). |
| Experiment Setup | Yes | For all environments, we fix β = 1.0 for QRC, η = 1.0 for QC and do not use target networks (for experiments with target networks see Appendix F). In the two classic control environments, we use 200 runs, an ϵ-greedy policy with ϵ = 0.1 and a discount of γ = 0.99. For the two Min Atar environments, Breakout and Space Invaders, we use 30 runs, γ = 0.99 and a decayed ϵ-greedy policy with ϵ = 1 decaying linearly to ϵ = 0.1 over the first 100,000 steps. All methods use a network with one convolutional layer, followed by a fully connected layer. |