Efficiently Solving MDPs with Stochastic Mirror Descent

Authors: Yujia Jin, Aaron Sidford

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present a unified framework based on primaldual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with Atot total actions and mixing time bound tmix our method computes an ϵ-optimal policy with an expected e O(t2 mix Atotϵ 2) samples from the statetransition matrix, removing the ergodicity dependence of prior art. When applied to a γ-discounted MDP with Atot total actions our method computes an ϵ-optimal policy with an expected e O((1 γ) 4Atotϵ 2) samples, improving over previous primal-dual methods and matching the state-ofthe-art up to a (1 γ) 1 factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs. (Abstract) and Table 1 gives a comparison of sample complexity between our methods and prior methods4 for computing an ϵ-approximate policy in DMDPs and AMDPs given a generative model. (Page 2)
Researcher Affiliation Academia Yujia Jin 1 Aaron Sidford 1 1Management Science and Engineering (MS&E), Stanford University, Stanford, United States. Correspondence to: Yujia Jin <yujiajin@stanford.edu>. (Page 1)
Pseudocode Yes Algorithm 1 SMD for mixing AMDP / DMDPs (Page 3) and Algorithm 2 SMD for ℓ -ℓ1 game (Page 4)
Open Source Code No The paper does not provide any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets No The paper discusses theoretical sample complexity bounds for MDPs given a generative model, but does not refer to specific real-world datasets for training or provide access information for any dataset.
Dataset Splits No The paper does not describe any empirical experiments on datasets, thus no training, validation, or test splits are mentioned.
Hardware Specification No The paper is theoretical and focuses on sample complexity analysis; it does not mention any specific hardware used for experiments or computations.
Software Dependencies No The paper is theoretical and does not provide details on specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe any empirical experiments; therefore, it does not provide details on experimental setup or hyperparameter values.