Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Minimax Regret Bounds for Reinforcement Learning
Authors: Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos
ICML 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We show that an optimistic modification to value iteration achieves a regret bound of e O(HSAT +H2S2A+HT) where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves over the best previous known bound e O(HSAT) achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when T H3S3A and SA H, it leads to a regret of e O(HSAT) that matches the established lower bound of Ω(HSAT) up to a logarithmic factor. Our analysis contains two key insights. |
| Researcher Affiliation | Industry | 1Deep Mind, London, UK. Correspondence to: Mohammad Gheshlaghi Azar <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 UCBVI, Algorithm 2 UCB-Q-values, Algorithm 3 bonus_1, Algorithm 4 bonus_2 |
| Open Source Code | No | The paper does not provide any concrete access to source code (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | No | The paper is theoretical and focuses on regret bounds for reinforcement learning in finite horizon MDPs. It does not mention using specific, publicly available datasets for training experiments. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical experiments, therefore it does not specify any dataset splits (training, validation, or test) for reproduction. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe any experimental setup details such as hyperparameter values or training configurations. |