reproducibilityindex.ai

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Authors: Mark Rowland, Yunhao Tang, Clare Lyle, Remi Munos, Marc G Bellemare, Will Dabney

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	These insights lead to several testable hypotheses, which we use to conduct a further empirical study to better characterise domains in which QTD offers superior performance to TD, and vice versa, and find several common trends.
Researcher Affiliation	Industry	Mark Rowland 1 Yunhao Tang 1 Clare Lyle 1 R emi Munos 1 Marc G. Bellemare 2 Will Dabney 1 1Deep Mind 2Google Research, Brain team.
Pseudocode	Yes	Algorithm 1 QTD(m) for value estimation. Require: Initial quantile estimates ((θ(x, i))m i=1 : x X), learning rate α, number of updates T.
Open Source Code	No	The paper does not provide an explicit statement about the release of its source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	The structure of the MRPs is given by the Cartesian product of three levels of stochasticity in transition structure: Deterministic cycle structure; Sparse stochastic transition structure (sampled from a Garnet distribution; Dense stochastic transition structure (sampled from Dirichlet(1, . . . , 1) distributions); Archibald et al., 1995); together with three levels of stochasticity in reward structure: Deterministic rewards; Gaussian (variance 1) rewards; Exponentially distributed (rate 1) rewards.
Dataset Splits	No	The paper describes experiments conducted via 'online interaction with the environments' and does not mention traditional train/validation/test dataset splits. It evaluates mean-squared error after a certain number of updates based on this interaction.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments. It only mentions that 'The experiments in this paper were undertaken using the Python 3 language'.
Software Dependencies	No	The paper states that experiments were undertaken using 'Python 3 language, and made use of the Num Py (Harris et al., 2020), Sci Py (Virtanen et al., 2020), and Matplotlib (Hunter, 2007) libraries', but it does not specify the version numbers for Python or any of these libraries.
Experiment Setup	Yes	Hyperparameters. In all experiments, we use a default discount factor of γ = 0.9. For both TD and QTD methods, all predictions are initialised to 0. Learning rates. For TD, 40 learning rates are swept over the range [5 10 4, 1], equally spaced in log-space. For QTD, 40 learning rates are swept over the range [5 10 3, 10], equally spaced in log-space. Each configuration was run 1,000 times...