Chaining Value Functions for Off-Policy Learning
Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt8187-8195
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically we evaluate the idea on challenging MDPs such as Baird s counter example and observe favourable results. In this section we empirically study how the corresponding stochastic update for chained TD converges on a selection of MDPs and observe favourable results. |
| Researcher Affiliation | Collaboration | Simon Schmitt1,2, John Shawe-Taylor2, Hado van Hasselt1 1Deep Mind 2University College London, UK suschmitt@google.com |
| Pseudocode | Yes | Algorithm 1: Sequential chained TD is described below. Concurrent chained TD is obtained by moving line 2 between line 6 and 7. |
| Open Source Code | No | No concrete access information (e.g., repository link or explicit statement of code release) for the methodology's source code was provided in the paper. |
| Open Datasets | Yes | Empirically we evaluate the idea on challenging MDPs such as Baird s counter example and observe favourable results. Baird s MDP With and Without Rewards Baird s MDP is a classic example that demonstrates the divergence of offpolicy TD with linear function approximation and has been used to evaluate the convergence of novel approaches. Originally proposed with a discount of γ = 0.99 it is often used with γ = 0.9, which results in lower variance updates. We consider both discounts. Furthermore we introduce a version of Baird s MDP with rewards as the rewards of the classic MDP are all 0. We refer to this MDP as the Baird-Reward MDP. The Threestate MDP Inspired by the Twostate MDP (Tsitsiklis and Van Roy 1997; Sutton, Mahmood, and White 2016) that demonstrates the divergence of off-policy TD concisely without rewards and with only two states, we propose the Threestate MDP with one middle state and two border states and two actions: left with 1 reward and right with 1 reward, leading to the corresponding neighbouring states or remaining if there is no further state in that direction. |
| Dataset Splits | No | The paper does not explicitly define training, validation, and test dataset splits with percentages or sample counts. It describes hyperparameter selection over 'the final 50% of transitions' but does not use the term 'validation set' or similar. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory) used for running experiments were mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library names and versions) were mentioned in the paper. |
| Experiment Setup | Yes | As hyper-parameters we consider all step-sizes α form the range S = {2 i/3|i {1, . . . , 40}} (i.e. logarithmically spaced between 9.6 10 5 and 0.5), for GTD2 and TDC we also consider all secondary step-sizes β form the same range, for chained TD we consider chains of length 256 and evaluate the performance of only 9 indices k I = {2i|i {0, . . . , 8}}. For sequential chained TD we split the training into windows of T {25, 50, 100, 200} steps during which only one θk is estimated and all others kept unchanged. To prevent pollution from accidentally good initial values we initialize all parameters from a Gaussian distribution with σ = 100 such that errors at t = 0 are high. |