Q-learning with Nearest Neighbors

Authors: Devavrat Shah, Qiaomin Xie

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ 2 (0, 1), given an arbitrary sample path with covering time L, we establish that the algorithm is guaranteed to output an "-accurate estimate of the optimal Q-function using e O L/("3(1 γ)7) samples. ...Our analysis consists of viewing our algorithm as a special case of a general biased stochastic approximation procedure, for which we establish non-asymptotic convergence guarantees.
Researcher Affiliation Academia Devavrat Shah Massachusetts Institute of Technology devavrat@mit.edu Qiaomin Xie Massachusetts Institute of Technology qxie@mit.edu Both authors are affiliated with Laboratory for Information and Decision Systems (LIDS). DS is with the Department of EECS as well as Statistics and Data Science Center at MIT.
Pseudocode Yes Policy 1 Nearest-Neighbor Q-learning Input: Exploration policy , discount factor γ, number of steps T, bandwidth parameter h, and initial state Y0. Construct discretized state space Xh; initialize t = k = 0, 0 = 1, q0 0; Foreach (ci, a) 2 Zh, set N0(ci, a) = 0; end repeat Draw action at ( |Yt) and observe reward Rt; generate the next state Yt+1 p( |Yt, at); Foreach i such that Yt 2 Bi do N = 1 Nk(ci,at)+1; if Nk(ci, at) > 0 then (Gkqk)(ci, at) = (1 N)(Gkqk)(ci, at) + N Rt + γ maxb2A(ΓNNqk)(Yt+1, b) ; else (Gkqk)(ci, at) = Rt + γ maxb2A(ΓNNqk)(Yt+1, b); end Nk(ci, at) = Nk(ci, at) + 1 end if min(ci,a)2Zh Nk(ci, a) > 0 then Foreach (ci, a) 2 Zh do qk+1(ci, a) = (1 k)qk(ci, a) + k(Gkqk)(ci, a); end k = k + 1; k = β β+k; Foreach (ci, a) 2 Zh do Nk(ci, a) = 0; end end t = t + 1; until t T; return ˆq = qk
Open Source Code No The paper does not provide any information or links to open-source code for the described methodology.
Open Datasets No The paper is theoretical and analyzes an algorithm's properties; it does not use or provide access information for any specific public dataset for training.
Dataset Splits No The paper is theoretical and does not report on empirical experiments with dataset splits. No specific dataset split information (training, validation, test) is provided.
Hardware Specification No The paper is theoretical and does not describe any empirical experiments, thus no specific hardware specifications for running experiments are mentioned.
Software Dependencies No The paper is theoretical and does not describe empirical experiments or an implementation that would require specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and focuses on mathematical analysis of an algorithm rather than empirical evaluation. Therefore, it does not include details about an experimental setup, hyperparameters, or training configurations.