Finding good policies in average-reward Markov Decision Processes without prior knowledge
Authors: Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In the generative model setting, a first hope to get an algorithm agnostic to H is to plug-in a tight upper bound on this quantity in the algorithm of [27]. Our first contribution is a negative result: the number of samples necessary to estimate H within a prescribed accuracy is not polynomial in H, S and A. This result is proved in Section 3. On the positive side, by combining a procedure for estimating the diameter D inspired by [21] with the algorithm of [27] we propose in Section 4 Diameter Free Exploration (DFE), an algorithm for communicating MDPs that does not require any prior knowledge on the MDP and has a near-optimal e O (SAD/ε2) log(1/δ) sample complexity in the asymptotic regime of small ε. Then in Section 5 we discuss the hardness of (ε, δ)-PAC policy identification in the online setting. We notably prove a lower bound showing that the sample complexity of (ε, δ)-PAC policy identification cannot always be polynomial in S, A and H, even with the knowledge of H. On the algorithmic side, we propose in Section 6 an online variant of DFE whose sample complexity scales in SAD2/ε2, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound. |
| Researcher Affiliation | Academia | Adrienne Tuynman adrienne.tuynman@inria.fr Rémy Degenne remy.degenne@inria.fr Emilie Kaufmann emilie.kaufmann@univ-lille.fr Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189-CRISt AL, F-59000 Lille, France |
| Pseudocode | Yes | Algorithm 1: Diameter Free Exploration (DFE), Algorithm 2: A diameter estimation procedure, Algorithm 3: Algorithm 2 from [27], Algorithm 4: Uniform sampling combined with adaptive stopping |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper is theoretical and does not describe or use any specific dataset for its own research, thus no information about public availability is provided. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical evaluation on datasets, thus no information about training, validation, or test splits is provided. |
| Hardware Specification | No | The paper is theoretical and does not involve running experiments, therefore no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not involve running experiments that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not involve running experiments, therefore no experimental setup details like hyperparameter values or training configurations are provided. |