Natural Temporal Difference Learning

Authors: William Dabney, Philip Thomas

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude with empirical comparisons on three canonical domains (mountain car, cartpole balancing, and acrobot) and one novel challenging domain (playing Tic-tac-toe using handwritten letters as input).
Researcher Affiliation Academia William Dabney and Philip S. Thomas School of Computer Science University of Massachusetts Amherst 140 Governors Dr., Amherst, MA 01003 {wdabney,pthomas}@cs.umass.edu
Pseudocode Yes Algorithm 1 Natural Residual Gradient; Algorithm 2 Natural Linear-Time Residual Gradient; Algorithm 3 Natural Sarsa(λ); Algorithm 4 Natural TDC
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We used an ϵ-greedy policy for all TD-learning algorithms. To evaluate the performance of natural TDC we generate experience from a fixed policy in the acrobot domain...For mountain car, cart-pole balancing, and acrobot we used linear function approximation with a third-order Fourier basis (Konidaris et al. 2012). On visual Tic-tac-toe we used a fully-connected feed-forward artificial neural network with one hidden layer of 20 nodes. This allows us to show the benefits of natural gradients when the value function parameterization is non-linear and more complex.
Dataset Splits No No specific dataset split information for validation was provided.
Hardware Specification No The paper does not provide specific hardware details used for running experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes We optimized the algorithm parameters for all experiments using a randomized search as suggested by Bergstra and Bengio (2012). We used an ϵ-greedy policy for all TD-learning algorithms. For mountain car, cart-pole balancing, and acrobot we used linear function approximation with a third-order Fourier basis (Konidaris et al. 2012). On visual Tic-tac-toe we used a fully-connected feed-forward artificial neural network with one hidden layer of 20 nodes.