reproducibilityindex.ai

Regret Bounds for Learning State Representations in Reinforcement Learning

Authors: Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper we introduce UCB-MS, an optimistic elimination algorithm that performs efﬁcient exploration of the representations. For this algorithm we prove a regret bound of order e O( p\|Φ\|T). Our algorithm as well as our results are based on and generalize the regret bounds achieved for the UCRL2 algorithm in [9]. In particular, if Φ consists of a single Markov model we obtain the same regret bound as for UCRL2. UCB-MS employs optimism to choose a model from Φ. To avoid suffering too large regret from choosing a non-Markov model, the collected rewards are compared to regret bounds that are known to hold for Markov models. If a model fails to give sufﬁciently high reward, it is eliminated. On the other hand, UCB-MS is happy to employ a non-Markov model as long as it gives as much reward as it would be expected from a corresponding Markov model.
Researcher Affiliation	Collaboration	Ronald Ortner Montanuniversität Leoben rortner@unileoben.ac.at Matteo Pirotta Facebook AI Research pirotta@fb.com Ronan Fruit Sequel Team INRIA Lille ronan.fruit@inria.fr Alessandro Lazaric Facebook AI Research lazaric@fb.com Odalric-Ambrym Maillard Sequel Team INRIA Lille odalric.maillard@inria.fr
Pseudocode	Yes	Algorithm 1 UCB-Model Selection (UCB-MS)
Open Source Code	No	The paper does not provide any links to open-source code for the described methodology. It discusses the algorithm and its theoretical bounds.
Open Datasets	No	The paper is theoretical and focuses on algorithm design and regret bounds in reinforcement learning. It does not describe experiments on specific datasets, so no dataset information is provided.
Dataset Splits	No	The paper is theoretical and focuses on algorithm design and regret bounds in reinforcement learning. It does not describe experiments or dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe any experimental hardware specifications.
Software Dependencies	No	The paper is theoretical and does not mention specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical, presenting an algorithm and its regret bounds. It does not describe an experimental setup with hyperparameters or system-level training settings.