Regret Bounds for Learning State Representations in Reinforcement Learning
Authors: Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper we introduce UCB-MS, an optimistic elimination algorithm that performs efficient exploration of the representations. For this algorithm we prove a regret bound of order e O( p|Φ|T). Our algorithm as well as our results are based on and generalize the regret bounds achieved for the UCRL2 algorithm in [9]. In particular, if Φ consists of a single Markov model we obtain the same regret bound as for UCRL2. UCB-MS employs optimism to choose a model from Φ. To avoid suffering too large regret from choosing a non-Markov model, the collected rewards are compared to regret bounds that are known to hold for Markov models. If a model fails to give sufficiently high reward, it is eliminated. On the other hand, UCB-MS is happy to employ a non-Markov model as long as it gives as much reward as it would be expected from a corresponding Markov model. |
| Researcher Affiliation | Collaboration | Ronald Ortner Montanuniversität Leoben rortner@unileoben.ac.at Matteo Pirotta Facebook AI Research pirotta@fb.com Ronan Fruit Sequel Team INRIA Lille ronan.fruit@inria.fr Alessandro Lazaric Facebook AI Research lazaric@fb.com Odalric-Ambrym Maillard Sequel Team INRIA Lille odalric.maillard@inria.fr |
| Pseudocode | Yes | Algorithm 1 UCB-Model Selection (UCB-MS) |
| Open Source Code | No | The paper does not provide any links to open-source code for the described methodology. It discusses the algorithm and its theoretical bounds. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and regret bounds in reinforcement learning. It does not describe experiments on specific datasets, so no dataset information is provided. |
| Dataset Splits | No | The paper is theoretical and focuses on algorithm design and regret bounds in reinforcement learning. It does not describe experiments or dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental hardware specifications. |
| Software Dependencies | No | The paper is theoretical and does not mention specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical, presenting an algorithm and its regret bounds. It does not describe an experimental setup with hyperparameters or system-level training settings. |