Chaining Value Functions for Off-Policy Learning

Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt8187-8195

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically we evaluate the idea on challenging MDPs such as Baird s counter example and observe favourable results. In this section we empirically study how the corresponding stochastic update for chained TD converges on a selection of MDPs and observe favourable results.
Researcher Affiliation Collaboration Simon Schmitt1,2, John Shawe-Taylor2, Hado van Hasselt1 1Deep Mind 2University College London, UK suschmitt@google.com
Pseudocode Yes Algorithm 1: Sequential chained TD is described below. Concurrent chained TD is obtained by moving line 2 between line 6 and 7.
Open Source Code No No concrete access information (e.g., repository link or explicit statement of code release) for the methodology's source code was provided in the paper.
Open Datasets Yes Empirically we evaluate the idea on challenging MDPs such as Baird s counter example and observe favourable results. Baird s MDP With and Without Rewards Baird s MDP is a classic example that demonstrates the divergence of offpolicy TD with linear function approximation and has been used to evaluate the convergence of novel approaches. Originally proposed with a discount of γ = 0.99 it is often used with γ = 0.9, which results in lower variance updates. We consider both discounts. Furthermore we introduce a version of Baird s MDP with rewards as the rewards of the classic MDP are all 0. We refer to this MDP as the Baird-Reward MDP. The Threestate MDP Inspired by the Twostate MDP (Tsitsiklis and Van Roy 1997; Sutton, Mahmood, and White 2016) that demonstrates the divergence of off-policy TD concisely without rewards and with only two states, we propose the Threestate MDP with one middle state and two border states and two actions: left with 1 reward and right with 1 reward, leading to the corresponding neighbouring states or remaining if there is no further state in that direction.
Dataset Splits No The paper does not explicitly define training, validation, and test dataset splits with percentages or sample counts. It describes hyperparameter selection over 'the final 50% of transitions' but does not use the term 'validation set' or similar.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory) used for running experiments were mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., library names and versions) were mentioned in the paper.
Experiment Setup Yes As hyper-parameters we consider all step-sizes α form the range S = {2 i/3|i {1, . . . , 40}} (i.e. logarithmically spaced between 9.6 10 5 and 0.5), for GTD2 and TDC we also consider all secondary step-sizes β form the same range, for chained TD we consider chains of length 256 and evaluate the performance of only 9 indices k I = {2i|i {0, . . . , 8}}. For sequential chained TD we split the training into windows of T {25, 50, 100, 200} steps during which only one θk is estimated and all others kept unchanged. To prevent pollution from accidentally good initial values we initialize all parameters from a Gaussian distribution with σ = 100 such that errors at t = 0 are high.