Natural Temporal Difference Learning
Authors: William Dabney, Philip Thomas
AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conclude with empirical comparisons on three canonical domains (mountain car, cartpole balancing, and acrobot) and one novel challenging domain (playing Tic-tac-toe using handwritten letters as input). |
| Researcher Affiliation | Academia | William Dabney and Philip S. Thomas School of Computer Science University of Massachusetts Amherst 140 Governors Dr., Amherst, MA 01003 {wdabney,pthomas}@cs.umass.edu |
| Pseudocode | Yes | Algorithm 1 Natural Residual Gradient; Algorithm 2 Natural Linear-Time Residual Gradient; Algorithm 3 Natural Sarsa(λ); Algorithm 4 Natural TDC |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We used an ϵ-greedy policy for all TD-learning algorithms. To evaluate the performance of natural TDC we generate experience from a fixed policy in the acrobot domain...For mountain car, cart-pole balancing, and acrobot we used linear function approximation with a third-order Fourier basis (Konidaris et al. 2012). On visual Tic-tac-toe we used a fully-connected feed-forward artificial neural network with one hidden layer of 20 nodes. This allows us to show the benefits of natural gradients when the value function parameterization is non-linear and more complex. |
| Dataset Splits | No | No specific dataset split information for validation was provided. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We optimized the algorithm parameters for all experiments using a randomized search as suggested by Bergstra and Bengio (2012). We used an ϵ-greedy policy for all TD-learning algorithms. For mountain car, cart-pole balancing, and acrobot we used linear function approximation with a third-order Fourier basis (Konidaris et al. 2012). On visual Tic-tac-toe we used a fully-connected feed-forward artificial neural network with one hidden layer of 20 nodes. |