Finding good policies in average-reward Markov Decision Processes without prior knowledge

Authors: Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In the generative model setting, a first hope to get an algorithm agnostic to H is to plug-in a tight upper bound on this quantity in the algorithm of [27]. Our first contribution is a negative result: the number of samples necessary to estimate H within a prescribed accuracy is not polynomial in H, S and A. This result is proved in Section 3. On the positive side, by combining a procedure for estimating the diameter D inspired by [21] with the algorithm of [27] we propose in Section 4 Diameter Free Exploration (DFE), an algorithm for communicating MDPs that does not require any prior knowledge on the MDP and has a near-optimal e O (SAD/ε2) log(1/δ) sample complexity in the asymptotic regime of small ε. Then in Section 5 we discuss the hardness of (ε, δ)-PAC policy identification in the online setting. We notably prove a lower bound showing that the sample complexity of (ε, δ)-PAC policy identification cannot always be polynomial in S, A and H, even with the knowledge of H. On the algorithmic side, we propose in Section 6 an online variant of DFE whose sample complexity scales in SAD2/ε2, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
Researcher Affiliation Academia Adrienne Tuynman adrienne.tuynman@inria.fr Rémy Degenne remy.degenne@inria.fr Emilie Kaufmann emilie.kaufmann@univ-lille.fr Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189-CRISt AL, F-59000 Lille, France
Pseudocode Yes Algorithm 1: Diameter Free Exploration (DFE), Algorithm 2: A diameter estimation procedure, Algorithm 3: Algorithm 2 from [27], Algorithm 4: Uniform sampling combined with adaptive stopping
Open Source Code No The paper does not provide any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets No The paper is theoretical and does not describe or use any specific dataset for its own research, thus no information about public availability is provided.
Dataset Splits No The paper is theoretical and does not involve empirical evaluation on datasets, thus no information about training, validation, or test splits is provided.
Hardware Specification No The paper is theoretical and does not involve running experiments, therefore no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not involve running experiments that would require specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not involve running experiments, therefore no experimental setup details like hyperparameter values or training configurations are provided.