reproducibilityindex.ai

Preferential Temporal Difference Learning

Authors: Nishanth Anand, Doina Precup

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we test Preferential TD on policy evaluation tasks in four different settings: tabular, linear, semi-linear (linear predictor with non-linear features), and non-linear (end-to-end training). Note that our theoretical results do not cover the last setup; however, it is quite easy to implement the algorithm in this setup.
Researcher Affiliation	Collaboration	1Mila (Quebec Artiﬁcial Intelligence Institute), Montreal, Canada 2School of Computer Science, Mc Gill University, Montreal, Canada 3Deepmind, Montreal, Canada. Correspondence to: Nishanth Anand <nishanth.anand@mail.mcgill.ca>.
Pseudocode	Yes	Algorithm 1 Preferential TD: Linear FA
Open Source Code	Yes	Code to reproduce the results can be found here.
Open Datasets	Yes	Task description: We consider the 19-state random walk problem from Sutton 1988; Sutton & Barto 2018.
Dataset Splits	No	The paper does not explicitly specify training, validation, and test dataset splits with percentages or counts for reproducibility. It describes experimental tasks and settings, often involving interaction with an environment rather than pre-defined static datasets with explicit splits for training and validation.
Hardware Specification	No	The paper does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for experiments.
Software Dependencies	No	The paper describes algorithms and experimental setups but does not provide specific version numbers for any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For ETD, we selected a constant interest of 0.01 on all states (selected from {0.01, 0.05, 0.1, 0.25} based on hyperparameter search) for each λ-α pair. We varied the corridor length ({5, 10, 15, 20, 25}) in our experiments. For each length and algorithm, we chose an optimal learning rate from 20 different values. We used γ = 1, a uniformly random policy, and the value function is predicted only at fully observable states. ... We experimented with {1, 2, 4, 8, 16} hidden units to check if PTD can estimate the values when the approximation capacity is limited.