reproducibilityindex.ai

Online learning in MDPs with linear function approximation and bandit feedback.

Authors: Gergely Neu, Julia Olkhovskaya

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	Our main contribution is developing an algorithm whose expected regret after T episodes is bounded by e O d HT , where H is the number of steps in each episode and d is the dimensionality of the feature map. Our main contribution is designing a computationally efﬁcient algorithm called ONLINE Q-REPS, and prove that in the setting described above, its regret is at most O p d HTD (µ µ0) . This section presents our main contributions: a new efﬁcient algorithm for the setting described above, along with its performance guarantees. Our algorithm design is based on a reduction to online linear optimization... This section gives the proof of Theorem 1 by stating the main technical results as lemmas and putting them together to obtain the ﬁnal bound.
Researcher Affiliation	Academia	Gergely Neu Universitat Pompeu Fabra Barcelona, Spain gergely.neu@gmail.com Julia Olkhovskaya Universitat Pompeu Fabra Barcelona, Spain julia.olkhovskaya@gmail.com
Pseudocode	Yes	Algorithm 1 ONLINE Q-REPS Parameters: η, α > 0, exploration parameter γ (0, 1), Initialization: Set bθ1,h = 0 for all h, compute Z1. For t = 1, . . . , T, repeat: Draw Yt Ber(γ), For h = 1, . . . , H, do: Observe Xt,h and, for all a A(Xt,h), set πt,h(a\|Xt,h) = π0,h(a\|Xt,h)eα(QZt(Xt,h,a) VZt(Xt,h)), if Y = 0, draw At,h πt,h( \|Xt,h), otherwise draw At,h π0,h( \|Xt,h), observe the reward rt,h(Xt,h, At,h). Compute bθt,1, . . . , bθt,H 1, Zt+1.
Open Source Code	No	The paper does not contain any statements or links indicating that source code for the described methodology is publicly available.
Open Datasets	No	The paper focuses on theoretical algorithm design and regret analysis in an episodic Markov decision process. It does not describe experiments with specific datasets, public or otherwise.
Dataset Splits	No	The paper focuses on theoretical algorithm design and regret analysis. It does not mention training, validation, or test dataset splits.
Hardware Specification	No	The paper is theoretical in nature, focusing on algorithm design and proofs. It does not describe any experimental setup or the hardware used for computations.
Software Dependencies	No	The paper is theoretical and describes an algorithm and its mathematical properties. It does not mention any specific software dependencies with version numbers required for implementation or experimentation.
Experiment Setup	No	The paper is theoretical, presenting an algorithm and its regret bounds. It does not include details about an experimental setup, such as hyperparameters or system-level training settings.