Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Authors: Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, Yuxin Chen

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Asynchronous Q-learning aims to learn the optimal action-value function (or Qfunction) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a γ-discounted MDP with state space S and action space A, we demonstrate that the ℓ -based sample complexity of classical asynchronous Q-learning namely, the number of samples needed to yield an entrywise ε-accurate estimate of the Q-function is at most on the order of 1 µmin(1 γ)5ε2 + tmix µmin(1 γ) up to some logarithmic factor, provided that a proper constant learning rate is adopted.
Researcher Affiliation Academia Gen Li Tsinghua Yuting Wei CMU Yuejie Chi CMU Yuantao Gu Tsinghua Yuxin Chen Princeton
Pseudocode Yes Algorithm 1: Asynchronous Q-learning; Algorithm 2: Asynchronous variance-reduced Q-learning; Algorithm 3: function Q = VR-Q-RUN-EPOCH
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There are no links to repositories or explicit statements about code release.
Open Datasets No The paper is theoretical and does not use or reference any datasets for training or empirical evaluation.
Dataset Splits No The paper is theoretical and does not involve experimental validation or dataset splits.
Hardware Specification No The paper is theoretical and does not describe any experiments that would require hardware specifications.
Software Dependencies No The paper is theoretical and does not detail any software dependencies or specific version numbers required for reproducing empirical results.
Experiment Setup No The paper is theoretical and focuses on algorithm design and analysis, not on specific experimental setups with hyperparameters or training configurations for empirical runs.