Q-learning with Nearest Neighbors
Authors: Devavrat Shah, Qiaomin Xie
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ 2 (0, 1), given an arbitrary sample path with covering time L, we establish that the algorithm is guaranteed to output an "-accurate estimate of the optimal Q-function using e O L/("3(1 γ)7) samples. ...Our analysis consists of viewing our algorithm as a special case of a general biased stochastic approximation procedure, for which we establish non-asymptotic convergence guarantees. |
| Researcher Affiliation | Academia | Devavrat Shah Massachusetts Institute of Technology devavrat@mit.edu Qiaomin Xie Massachusetts Institute of Technology qxie@mit.edu Both authors are affiliated with Laboratory for Information and Decision Systems (LIDS). DS is with the Department of EECS as well as Statistics and Data Science Center at MIT. |
| Pseudocode | Yes | Policy 1 Nearest-Neighbor Q-learning Input: Exploration policy , discount factor γ, number of steps T, bandwidth parameter h, and initial state Y0. Construct discretized state space Xh; initialize t = k = 0, 0 = 1, q0 0; Foreach (ci, a) 2 Zh, set N0(ci, a) = 0; end repeat Draw action at ( |Yt) and observe reward Rt; generate the next state Yt+1 p( |Yt, at); Foreach i such that Yt 2 Bi do N = 1 Nk(ci,at)+1; if Nk(ci, at) > 0 then (Gkqk)(ci, at) = (1 N)(Gkqk)(ci, at) + N Rt + γ maxb2A(ΓNNqk)(Yt+1, b) ; else (Gkqk)(ci, at) = Rt + γ maxb2A(ΓNNqk)(Yt+1, b); end Nk(ci, at) = Nk(ci, at) + 1 end if min(ci,a)2Zh Nk(ci, a) > 0 then Foreach (ci, a) 2 Zh do qk+1(ci, a) = (1 k)qk(ci, a) + k(Gkqk)(ci, a); end k = k + 1; k = β β+k; Foreach (ci, a) 2 Zh do Nk(ci, a) = 0; end end t = t + 1; until t T; return ˆq = qk |
| Open Source Code | No | The paper does not provide any information or links to open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and analyzes an algorithm's properties; it does not use or provide access information for any specific public dataset for training. |
| Dataset Splits | No | The paper is theoretical and does not report on empirical experiments with dataset splits. No specific dataset split information (training, validation, test) is provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any empirical experiments, thus no specific hardware specifications for running experiments are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe empirical experiments or an implementation that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on mathematical analysis of an algorithm rather than empirical evaluation. Therefore, it does not include details about an experimental setup, hyperparameters, or training configurations. |