Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we report numerical simulations supporting our theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.
Researcher Affiliation Collaboration 1Seque L Team, INRIA Lille, France 2Facebook AI Research, Paris, France 3Montanuniversit at Leoben, Austria.
Pseudocode Yes Figure 1. The general structure of optimistic algorithms for RL. and Figure 3. Algorithm SCOPT.
Open Source Code Yes The code is available on Git Hub.
Open Datasets No The paper uses a 'simple but descriptive three-state domain' and specifies reward distributions (Bernoulli) but does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions that 'The code is available on Git Hub.' but does not list specific ancillary software components with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes In all the experiments, we noticed that perturbing the extended MDP was not necessary to ensure convergence of SCOPT and so we set ηk = 0. We also set γk = 0 to speed-up the execution of SCOPT (see stopping condition in Fig. 3).