Preferential Temporal Difference Learning

Authors: Nishanth Anand, Doina Precup

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we test Preferential TD on policy evaluation tasks in four different settings: tabular, linear, semi-linear (linear predictor with non-linear features), and non-linear (end-to-end training). Note that our theoretical results do not cover the last setup; however, it is quite easy to implement the algorithm in this setup.
Researcher Affiliation Collaboration 1Mila (Quebec Artificial Intelligence Institute), Montreal, Canada 2School of Computer Science, Mc Gill University, Montreal, Canada 3Deepmind, Montreal, Canada. Correspondence to: Nishanth Anand <nishanth.anand@mail.mcgill.ca>.
Pseudocode Yes Algorithm 1 Preferential TD: Linear FA
Open Source Code Yes Code to reproduce the results can be found here.
Open Datasets Yes Task description: We consider the 19-state random walk problem from Sutton 1988; Sutton & Barto 2018.
Dataset Splits No The paper does not explicitly specify training, validation, and test dataset splits with percentages or counts for reproducibility. It describes experimental tasks and settings, often involving interaction with an environment rather than pre-defined static datasets with explicit splits for training and validation.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for experiments.
Software Dependencies No The paper describes algorithms and experimental setups but does not provide specific version numbers for any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For ETD, we selected a constant interest of 0.01 on all states (selected from {0.01, 0.05, 0.1, 0.25} based on hyperparameter search) for each λ-α pair. We varied the corridor length ({5, 10, 15, 20, 25}) in our experiments. For each length and algorithm, we chose an optimal learning rate from 20 different values. We used γ = 1, a uniformly random policy, and the value function is predicted only at fully observable states. ... We experimented with {1, 2, 4, 8, 16} hidden units to check if PTD can estimate the values when the approximation capacity is limited.