Preferential Temporal Difference Learning
Authors: Nishanth Anand, Doina Precup
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we test Preferential TD on policy evaluation tasks in four different settings: tabular, linear, semi-linear (linear predictor with non-linear features), and non-linear (end-to-end training). Note that our theoretical results do not cover the last setup; however, it is quite easy to implement the algorithm in this setup. |
| Researcher Affiliation | Collaboration | 1Mila (Quebec Artificial Intelligence Institute), Montreal, Canada 2School of Computer Science, Mc Gill University, Montreal, Canada 3Deepmind, Montreal, Canada. Correspondence to: Nishanth Anand <nishanth.anand@mail.mcgill.ca>. |
| Pseudocode | Yes | Algorithm 1 Preferential TD: Linear FA |
| Open Source Code | Yes | Code to reproduce the results can be found here. |
| Open Datasets | Yes | Task description: We consider the 19-state random walk problem from Sutton 1988; Sutton & Barto 2018. |
| Dataset Splits | No | The paper does not explicitly specify training, validation, and test dataset splits with percentages or counts for reproducibility. It describes experimental tasks and settings, often involving interaction with an environment rather than pre-defined static datasets with explicit splits for training and validation. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for experiments. |
| Software Dependencies | No | The paper describes algorithms and experimental setups but does not provide specific version numbers for any software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For ETD, we selected a constant interest of 0.01 on all states (selected from {0.01, 0.05, 0.1, 0.25} based on hyperparameter search) for each λ-α pair. We varied the corridor length ({5, 10, 15, 20, 25}) in our experiments. For each length and algorithm, we chose an optimal learning rate from 20 different values. We used γ = 1, a uniformly random policy, and the value function is predicted only at fully observable states. ... We experimented with {1, 2, 4, 8, 16} hidden units to check if PTD can estimate the values when the approximation capacity is limited. |