Regret Bounds for Learning State Representations in Reinforcement Learning

Authors: Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper we introduce UCB-MS, an optimistic elimination algorithm that performs efficient exploration of the representations. For this algorithm we prove a regret bound of order e O( p|Φ|T). Our algorithm as well as our results are based on and generalize the regret bounds achieved for the UCRL2 algorithm in [9]. In particular, if Φ consists of a single Markov model we obtain the same regret bound as for UCRL2. UCB-MS employs optimism to choose a model from Φ. To avoid suffering too large regret from choosing a non-Markov model, the collected rewards are compared to regret bounds that are known to hold for Markov models. If a model fails to give sufficiently high reward, it is eliminated. On the other hand, UCB-MS is happy to employ a non-Markov model as long as it gives as much reward as it would be expected from a corresponding Markov model.
Researcher Affiliation Collaboration Ronald Ortner Montanuniversität Leoben rortner@unileoben.ac.at Matteo Pirotta Facebook AI Research pirotta@fb.com Ronan Fruit Sequel Team INRIA Lille ronan.fruit@inria.fr Alessandro Lazaric Facebook AI Research lazaric@fb.com Odalric-Ambrym Maillard Sequel Team INRIA Lille odalric.maillard@inria.fr
Pseudocode Yes Algorithm 1 UCB-Model Selection (UCB-MS)
Open Source Code No The paper does not provide any links to open-source code for the described methodology. It discusses the algorithm and its theoretical bounds.
Open Datasets No The paper is theoretical and focuses on algorithm design and regret bounds in reinforcement learning. It does not describe experiments on specific datasets, so no dataset information is provided.
Dataset Splits No The paper is theoretical and focuses on algorithm design and regret bounds in reinforcement learning. It does not describe experiments or dataset splits.
Hardware Specification No The paper is theoretical and does not describe any experimental hardware specifications.
Software Dependencies No The paper is theoretical and does not mention specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical, presenting an algorithm and its regret bounds. It does not describe an experimental setup with hyperparameters or system-level training settings.