Minimax Regret Bounds for Reinforcement Learning
Authors: Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We show that an optimistic modification to value iteration achieves a regret bound of e O(HSAT +H2S2A+HT) where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves over the best previous known bound e O(HSAT) achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when T H3S3A and SA H, it leads to a regret of e O(HSAT) that matches the established lower bound of Ω(HSAT) up to a logarithmic factor. Our analysis contains two key insights. |
| Researcher Affiliation | Industry | 1Deep Mind, London, UK. Correspondence to: Mohammad Gheshlaghi Azar <mazar@google.com>. |
| Pseudocode | Yes | Algorithm 1 UCBVI, Algorithm 2 UCB-Q-values, Algorithm 3 bonus_1, Algorithm 4 bonus_2 |
| Open Source Code | No | The paper does not provide any concrete access to source code (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | No | The paper is theoretical and focuses on regret bounds for reinforcement learning in finite horizon MDPs. It does not mention using specific, publicly available datasets for training experiments. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical experiments, therefore it does not specify any dataset splits (training, validation, or test) for reproduction. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe any experimental setup details such as hyperparameter values or training configurations. |