The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation
Authors: Mark Rowland, Yunhao Tang, Clare Lyle, Remi Munos, Marc G Bellemare, Will Dabney
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These insights lead to several testable hypotheses, which we use to conduct a further empirical study to better characterise domains in which QTD offers superior performance to TD, and vice versa, and find several common trends. |
| Researcher Affiliation | Industry | Mark Rowland 1 Yunhao Tang 1 Clare Lyle 1 R emi Munos 1 Marc G. Bellemare 2 Will Dabney 1 1Deep Mind 2Google Research, Brain team. |
| Pseudocode | Yes | Algorithm 1 QTD(m) for value estimation. Require: Initial quantile estimates ((θ(x, i))m i=1 : x X), learning rate α, number of updates T. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | The structure of the MRPs is given by the Cartesian product of three levels of stochasticity in transition structure: Deterministic cycle structure; Sparse stochastic transition structure (sampled from a Garnet distribution; Dense stochastic transition structure (sampled from Dirichlet(1, . . . , 1) distributions); Archibald et al., 1995); together with three levels of stochasticity in reward structure: Deterministic rewards; Gaussian (variance 1) rewards; Exponentially distributed (rate 1) rewards. |
| Dataset Splits | No | The paper describes experiments conducted via 'online interaction with the environments' and does not mention traditional train/validation/test dataset splits. It evaluates mean-squared error after a certain number of updates based on this interaction. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. It only mentions that 'The experiments in this paper were undertaken using the Python 3 language'. |
| Software Dependencies | No | The paper states that experiments were undertaken using 'Python 3 language, and made use of the Num Py (Harris et al., 2020), Sci Py (Virtanen et al., 2020), and Matplotlib (Hunter, 2007) libraries', but it does not specify the version numbers for Python or any of these libraries. |
| Experiment Setup | Yes | Hyperparameters. In all experiments, we use a default discount factor of γ = 0.9. For both TD and QTD methods, all predictions are initialised to 0. Learning rates. For TD, 40 learning rates are swept over the range [5 10 4, 1], equally spaced in log-space. For QTD, 40 learning rates are swept over the range [5 10 3, 10], equally spaced in log-space. Each configuration was run 1,000 times... |