Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Natural Temporal Difference Learning
Authors: William Dabney, Philip Thomas
AAAI 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conclude with empirical comparisons on three canonical domains (mountain car, cartpole balancing, and acrobot) and one novel challenging domain (playing Tic-tac-toe using handwritten letters as input). |
| Researcher Affiliation | Academia | William Dabney and Philip S. Thomas School of Computer Science University of Massachusetts Amherst 140 Governors Dr., Amherst, MA 01003 EMAIL |
| Pseudocode | Yes | Algorithm 1 Natural Residual Gradient; Algorithm 2 Natural Linear-Time Residual Gradient; Algorithm 3 Natural Sarsa(λ); Algorithm 4 Natural TDC |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We used an ϵ-greedy policy for all TD-learning algorithms. To evaluate the performance of natural TDC we generate experience from a fixed policy in the acrobot domain...For mountain car, cart-pole balancing, and acrobot we used linear function approximation with a third-order Fourier basis (Konidaris et al. 2012). On visual Tic-tac-toe we used a fully-connected feed-forward artificial neural network with one hidden layer of 20 nodes. This allows us to show the benefits of natural gradients when the value function parameterization is non-linear and more complex. |
| Dataset Splits | No | No specific dataset split information for validation was provided. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We optimized the algorithm parameters for all experiments using a randomized search as suggested by Bergstra and Bengio (2012). We used an ϵ-greedy policy for all TD-learning algorithms. For mountain car, cart-pole balancing, and acrobot we used linear function approximation with a third-order Fourier basis (Konidaris et al. 2012). On visual Tic-tac-toe we used a fully-connected feed-forward artificial neural network with one hidden layer of 20 nodes. |