Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs
Authors: Andrea Tirinzoni, Aymen Al Marjani, Emilie Kaufmann
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we perform numerical simulations on random deterministic MDPs which reveal that EPRL can indeed improve over existing minimax-optimal algorithms tailored for the deterministic case. [...] We compare numerically EPRL to the minimax optimal BPI-UCRL algorithm [29], adapted to the deterministic setting, on synthetic MDP instances. [...] In Figure 1 we show the distribution of the sample complexity (estimated over 10 Monte Carlo simulations) |
| Researcher Affiliation | Collaboration | Andrea Tirinzoni Meta AI Paris, France Aymen Al-Marjani UMPA, ENS Lyon Lyon, France Emilie Kaufmann Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRISt AL Lille, France |
| Pseudocode | Yes | Algorithm 1 Elimination-based PAC RL (EPRL) for deterministic MDPs |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the appendix. |
| Open Datasets | No | We generate random easy deteministic MDP instances with Gaussian rewards of variance 1 using the following protocol. For fixed S, A, H the mean rewards rh(s, a) are drawn i.i.d. from a uniform distribution over [0, 1] and for each state-action pair, the next state is chosen uniformly at random in {1, . . . , S}. Finally, we only keep MDP instances whose minimum value gap, denoted by min, is larger than 0.1. (The paper describes generating data, not using a publicly available dataset with concrete access information like a link, DOI, or formal citation.) |
| Dataset Splits | No | The paper describes generating MDP instances and running Monte-Carlo simulations but does not specify explicit training/validation/test splits for a dataset. |
| Hardware Specification | No | The ethics checklist states that hardware details are in the appendix, but the appendix itself (Appendix F) only discusses implementation details and a baseline, without specifying GPU models, CPU types, or other hardware specifications. |
| Software Dependencies | No | The paper mentions using Python for implementation and refers to the BPI-UCRL algorithm, but it does not provide specific version numbers for any programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We generate random easy deteministic MDP instances with Gaussian rewards of variance 1 using the following protocol. For fixed S, A, H the mean rewards rh(s, a) are drawn i.i.d. from a uniform distribution over [0, 1] and for each state-action pair, the next state is chosen uniformly at random in {1, . . . , S}. [...] For EPRL, we experiment with two sampling rules: max-coverage (max Cov) and max-diameter (max D, see Line 18 of Algorithm 1). [...] S, A = 2 and H = 3 and for algorithms that are run with parameters δ = 0.1 and ε = 1.5 min. [...] We ran 10 Monte-Carlo simulations for each value of the triplet (MDP, algorithm A, ε). |