Online learning in MDPs with linear function approximation and bandit feedback.

Authors: Gergely Neu, Julia Olkhovskaya

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our main contribution is developing an algorithm whose expected regret after T episodes is bounded by e O d HT , where H is the number of steps in each episode and d is the dimensionality of the feature map. Our main contribution is designing a computationally efficient algorithm called ONLINE Q-REPS, and prove that in the setting described above, its regret is at most O p d HTD (µ µ0) . This section presents our main contributions: a new efficient algorithm for the setting described above, along with its performance guarantees. Our algorithm design is based on a reduction to online linear optimization... This section gives the proof of Theorem 1 by stating the main technical results as lemmas and putting them together to obtain the final bound.
Researcher Affiliation Academia Gergely Neu Universitat Pompeu Fabra Barcelona, Spain gergely.neu@gmail.com Julia Olkhovskaya Universitat Pompeu Fabra Barcelona, Spain julia.olkhovskaya@gmail.com
Pseudocode Yes Algorithm 1 ONLINE Q-REPS Parameters: η, α > 0, exploration parameter γ (0, 1), Initialization: Set bθ1,h = 0 for all h, compute Z1. For t = 1, . . . , T, repeat: Draw Yt Ber(γ), For h = 1, . . . , H, do: Observe Xt,h and, for all a A(Xt,h), set πt,h(a|Xt,h) = π0,h(a|Xt,h)eα(QZt(Xt,h,a) VZt(Xt,h)), if Y = 0, draw At,h πt,h( |Xt,h), otherwise draw At,h π0,h( |Xt,h), observe the reward rt,h(Xt,h, At,h). Compute bθt,1, . . . , bθt,H 1, Zt+1.
Open Source Code No The paper does not contain any statements or links indicating that source code for the described methodology is publicly available.
Open Datasets No The paper focuses on theoretical algorithm design and regret analysis in an episodic Markov decision process. It does not describe experiments with specific datasets, public or otherwise.
Dataset Splits No The paper focuses on theoretical algorithm design and regret analysis. It does not mention training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical in nature, focusing on algorithm design and proofs. It does not describe any experimental setup or the hardware used for computations.
Software Dependencies No The paper is theoretical and describes an algorithm and its mathematical properties. It does not mention any specific software dependencies with version numbers required for implementation or experimentation.
Experiment Setup No The paper is theoretical, presenting an algorithm and its regret bounds. It does not include details about an experimental setup, such as hyperparameters or system-level training settings.