Statistical Efficiency of Distributional Temporal Difference Learning

Authors: Yang Peng, Liangyu Zhang, Zhihua Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional temporal difference learning (NTD). For a γ-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need e O 1 ε2p(1 γ)2p+1 iterations to achieve an ε-optimal estimator with high probability, when the estimation error is measured by the p-Wasserstein distance. This sample complexity bound is minimax optimal up to logarithmic factors in the case of the 1-Wasserstein distance. To achieve this, we establish a novel Freedman s inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the p-Wasserstein distance for p 1.
Researcher Affiliation Academia School of Mathematical Sciences, Peking University; email: pengyang@pku.edu.cn. School of Statistics and Management, Shanghai University of Finance and Economics; email: zhangliangyu@sufe.edu.cn. School of Mathematical Sciences, Peking University; email: zhzhang@math.pku.edu.cn.
Pseudocode No The paper describes updating schemes for NTD and CTD using mathematical formulas and text, but does not present them in a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not provide concrete access to source code. The NeurIPS checklist, section 5, explicitly states 'NA' for this question, indicating no code release.
Open Datasets No This paper is theoretical and does not conduct experiments with datasets, therefore, there is no mention of training data.
Dataset Splits No This paper is theoretical and does not conduct experiments with datasets, therefore, there is no mention of validation data.
Hardware Specification No This paper is theoretical and does not describe any experiments that would require specific hardware. Therefore, no hardware specifications are mentioned.
Software Dependencies No This paper is theoretical and does not involve experimental implementation that would require specific software dependencies with version numbers.
Experiment Setup No This paper is theoretical and does not include experimental setups, hyperparameters, or training configurations. Its focus is on mathematical analysis of algorithms.