No-Regret Exploration in Goal-Oriented Reinforcement Learning
Authors: Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as e O(DS ADK) after K episodes for any unknown SSP with S states, A actions, positive costs and SSP-diameter D... Finally, we support our theoretical findings with experiments in App. J. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research, Paris, France 2Seque L team, Inria Lille Nord Europe, France. |
| Pseudocode | Yes | Algorithm 1 UC-SSP algorithm and Algorithm 2 EVISSP |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper describes using custom 'gridworld environments' for experiments, but it does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We report our experimental results in App. J. for two different environments described in Fig. 2 (a) and Fig. 2 (b). For the parameters of UC-SSP, we set the confidence δ = 0.05 and use cmin = 1 and cmax = 10 for the general SSP case, as well as cmin = cmax = 1 for the uniform-cost SSP case. We average the regret over 50 independent runs and plot the average regret with 95% confidence intervals. For the discount factor γ of UCRL2 and UCRL2B, we choose γ = 0.95. For UCBVI, we use H = 100 as the fixed horizon. |