Minimax Regret Bounds for Reinforcement Learning

Authors: Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We show that an optimistic modification to value iteration achieves a regret bound of e O(HSAT +H2S2A+HT) where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves over the best previous known bound e O(HSAT) achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when T H3S3A and SA H, it leads to a regret of e O(HSAT) that matches the established lower bound of Ω(HSAT) up to a logarithmic factor. Our analysis contains two key insights.
Researcher Affiliation Industry 1Deep Mind, London, UK. Correspondence to: Mohammad Gheshlaghi Azar <mazar@google.com>.
Pseudocode Yes Algorithm 1 UCBVI, Algorithm 2 UCB-Q-values, Algorithm 3 bonus_1, Algorithm 4 bonus_2
Open Source Code No The paper does not provide any concrete access to source code (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets No The paper is theoretical and focuses on regret bounds for reinforcement learning in finite horizon MDPs. It does not mention using specific, publicly available datasets for training experiments.
Dataset Splits No The paper is theoretical and does not conduct empirical experiments, therefore it does not specify any dataset splits (training, validation, or test) for reproduction.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup No The paper is theoretical and does not describe any experimental setup details such as hyperparameter values or training configurations.