Optimal Sample Complexity for Average Reward Markov Decision Processes

Authors: Shengbo Wang, Jose Blanchet, Peter Glynn

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we conduct numerical experiments to validate our theoretical findings.
Researcher Affiliation Academia Shengbo Wang, Jose Blanchet & Peter Glynn Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA {shengbo.wang,jose.blanchet,glynn}@stanford.edu
Pseudocode Yes Algorithm 1 Perturbed Model-based Planning (Li et al., 2020): PMBP(γ, ζ, n)... Algorithm 2 Reduction and Perturbed Model-based Planning
Open Source Code No The paper does not provide an unambiguous statement of releasing its source code, nor does it include a direct link to a code repository for the methodology described.
Open Datasets Yes The family of reward functions and transition kernels used for both experiments belongs to the family of hard instances constructed in Wang et al. (2023).
Dataset Splits No The paper mentions running numerical experiments with a family of hard instances from Wang et al. (2023) and conducting 300 replications, but does not specify explicit train, validation, or test dataset splits.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper does not specify version numbers for any key software components, libraries, or solvers used in the research.
Experiment Setup Yes Input: Discount factor γ (0, 1). Perturbation amplitude ζ > 0. Sample size n 1. (Algorithm 1) ... Input: Error tolerance ϵ (0, 1]. Assign γ = 1 ϵ 19tminorize , ζ = 1 4(1 γ)tminorize, and n = cβδ(η δ) (1 γ)2tminorize where c = 4 4862. (Algorithm 2) ... The experiments in Figure 1b use C = 4500 for the purple line and C = 18000 for the blue line.