Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs

Authors: Andrea Tirinzoni, Aymen Al Marjani, Emilie Kaufmann

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we perform numerical simulations on random deterministic MDPs which reveal that EPRL can indeed improve over existing minimax-optimal algorithms tailored for the deterministic case. [...] We compare numerically EPRL to the minimax optimal BPI-UCRL algorithm [29], adapted to the deterministic setting, on synthetic MDP instances. [...] In Figure 1 we show the distribution of the sample complexity (estimated over 10 Monte Carlo simulations)
Researcher Affiliation Collaboration Andrea Tirinzoni Meta AI Paris, France Aymen Al-Marjani UMPA, ENS Lyon Lyon, France Emilie Kaufmann Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRISt AL Lille, France
Pseudocode Yes Algorithm 1 Elimination-based PAC RL (EPRL) for deterministic MDPs
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the appendix.
Open Datasets No We generate random easy deteministic MDP instances with Gaussian rewards of variance 1 using the following protocol. For fixed S, A, H the mean rewards rh(s, a) are drawn i.i.d. from a uniform distribution over [0, 1] and for each state-action pair, the next state is chosen uniformly at random in {1, . . . , S}. Finally, we only keep MDP instances whose minimum value gap, denoted by min, is larger than 0.1. (The paper describes generating data, not using a publicly available dataset with concrete access information like a link, DOI, or formal citation.)
Dataset Splits No The paper describes generating MDP instances and running Monte-Carlo simulations but does not specify explicit training/validation/test splits for a dataset.
Hardware Specification No The ethics checklist states that hardware details are in the appendix, but the appendix itself (Appendix F) only discusses implementation details and a baseline, without specifying GPU models, CPU types, or other hardware specifications.
Software Dependencies No The paper mentions using Python for implementation and refers to the BPI-UCRL algorithm, but it does not provide specific version numbers for any programming languages, libraries, or other software dependencies.
Experiment Setup Yes We generate random easy deteministic MDP instances with Gaussian rewards of variance 1 using the following protocol. For fixed S, A, H the mean rewards rh(s, a) are drawn i.i.d. from a uniform distribution over [0, 1] and for each state-action pair, the next state is chosen uniformly at random in {1, . . . , S}. [...] For EPRL, we experiment with two sampling rules: max-coverage (max Cov) and max-diameter (max D, see Line 18 of Algorithm 1). [...] S, A = 2 and H = 3 and for algorithms that are run with parameters δ = 0.1 and ε = 1.5 min. [...] We ran 10 Monte-Carlo simulations for each value of the triplet (MDP, algorithm A, ε).