Optimal Sample Complexity for Average Reward Markov Decision Processes
Authors: Shengbo Wang, Jose Blanchet, Peter Glynn
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we conduct numerical experiments to validate our theoretical findings. |
| Researcher Affiliation | Academia | Shengbo Wang, Jose Blanchet & Peter Glynn Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA {shengbo.wang,jose.blanchet,glynn}@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Perturbed Model-based Planning (Li et al., 2020): PMBP(γ, ζ, n)... Algorithm 2 Reduction and Perturbed Model-based Planning |
| Open Source Code | No | The paper does not provide an unambiguous statement of releasing its source code, nor does it include a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | The family of reward functions and transition kernels used for both experiments belongs to the family of hard instances constructed in Wang et al. (2023). |
| Dataset Splits | No | The paper mentions running numerical experiments with a family of hard instances from Wang et al. (2023) and conducting 300 replications, but does not specify explicit train, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper does not specify version numbers for any key software components, libraries, or solvers used in the research. |
| Experiment Setup | Yes | Input: Discount factor γ (0, 1). Perturbation amplitude ζ > 0. Sample size n 1. (Algorithm 1) ... Input: Error tolerance ϵ (0, 1]. Assign γ = 1 ϵ 19tminorize , ζ = 1 4(1 γ)tminorize, and n = cβδ(η δ) (1 γ)2tminorize where c = 4 4862. (Algorithm 2) ... The experiments in Figure 1b use C = 4500 for the purple line and C = 18000 for the blue line. |