Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning

Authors: K. Lakshmanan, Ronald Ortner, Daniil Ryabko

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Regret bounds in this setting usually hold under various assumptions on the structure of the reward and transition function. Here we improve upon this result by using non-parametric kernel density estimation for estimating the transition probability distributions, and obtain regret bounds that depend on the smoothness of the transition probability distributions. In particular, under the assumption that the transition probability functions are smoothly differentiable, the regret bound is shown to be O(T 2 3 ) asymptotically for reinforcement learning in 1-dimensional state space. Finally, we also derive improved regret bounds for higher dimensional state space.
Researcher Affiliation Academia K.Lakshmanan LKSHMNAN.K@GMAIL.COM Montanuniversit at Leoben, Franz-Josef-Strasse 18, 8700 Leoben, AUSTRIA Ronald Ortner RORTNER@UNILEOBEN.AC.AT Montanuniversit at Leoben, Franz-Josef-Strasse 18, 8700 Leoben, AUSTRIA Daniil Ryabko DANIIL@RYABKO.NET INRIA Lille Nord Europe, 40 Avenue Halley, 59650 Villeneuve d Ascq, FRANCE
Pseudocode Yes Algorithm 1 UCCRL-Kernel Density Algorithm
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and operates within a defined state space "[0, 1]" for its Markov Decision Process (MDP) model. It does not use or provide access to a specific dataset for training.
Dataset Splits No This is a theoretical paper and does not involve empirical experiments with dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and focuses on mathematical derivations and algorithm design. It does not describe any empirical experiments, and therefore, no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and describes an algorithm (UCCRL-KD) and its regret bounds. It does not specify any software dependencies with version numbers required to replicate empirical results.
Experiment Setup No The paper is theoretical, describing an algorithm and deriving regret bounds. It does not include details about an experimental setup, hyperparameters, or training configurations.