Online learning in MDPs with linear function approximation and bandit feedback.
Authors: Gergely Neu, Julia Olkhovskaya
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our main contribution is developing an algorithm whose expected regret after T episodes is bounded by e O d HT , where H is the number of steps in each episode and d is the dimensionality of the feature map. Our main contribution is designing a computationally efficient algorithm called ONLINE Q-REPS, and prove that in the setting described above, its regret is at most O p d HTD (µ µ0) . This section presents our main contributions: a new efficient algorithm for the setting described above, along with its performance guarantees. Our algorithm design is based on a reduction to online linear optimization... This section gives the proof of Theorem 1 by stating the main technical results as lemmas and putting them together to obtain the final bound. |
| Researcher Affiliation | Academia | Gergely Neu Universitat Pompeu Fabra Barcelona, Spain gergely.neu@gmail.com Julia Olkhovskaya Universitat Pompeu Fabra Barcelona, Spain julia.olkhovskaya@gmail.com |
| Pseudocode | Yes | Algorithm 1 ONLINE Q-REPS Parameters: η, α > 0, exploration parameter γ (0, 1), Initialization: Set bθ1,h = 0 for all h, compute Z1. For t = 1, . . . , T, repeat: Draw Yt Ber(γ), For h = 1, . . . , H, do: Observe Xt,h and, for all a A(Xt,h), set πt,h(a|Xt,h) = π0,h(a|Xt,h)eα(QZt(Xt,h,a) VZt(Xt,h)), if Y = 0, draw At,h πt,h( |Xt,h), otherwise draw At,h π0,h( |Xt,h), observe the reward rt,h(Xt,h, At,h). Compute bθt,1, . . . , bθt,H 1, Zt+1. |
| Open Source Code | No | The paper does not contain any statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper focuses on theoretical algorithm design and regret analysis in an episodic Markov decision process. It does not describe experiments with specific datasets, public or otherwise. |
| Dataset Splits | No | The paper focuses on theoretical algorithm design and regret analysis. It does not mention training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is theoretical in nature, focusing on algorithm design and proofs. It does not describe any experimental setup or the hardware used for computations. |
| Software Dependencies | No | The paper is theoretical and describes an algorithm and its mathematical properties. It does not mention any specific software dependencies with version numbers required for implementation or experimentation. |
| Experiment Setup | No | The paper is theoretical, presenting an algorithm and its regret bounds. It does not include details about an experimental setup, such as hyperparameters or system-level training settings. |