Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game
Authors: Ziyi Chen, Shaocong Ma, Yi Zhou
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we develop a fully decentralized and sample-efficient stochastic policy extragradient algorithm for solving tabular zero-sum Markov games. In particular, our algorithm utilizes multiple stochastic estimators to accurately estimate the value functions involved in the stochastic updates, and leverages entropy regularization to accelerate the convergence. Specifically, with a proper entropy-regularization parameter, we prove that the stochastic policy extragradient algorithm has a sample complexity of the order e O( Amax µminϵ5.5(1 γ)13.5 ) for finding a solution that achieves ϵ-Nash equilibrium duality gap, where Amax is the maximum number of actions between the players, µmin is the lower bound of state stationary distribution, and γ is the discount factor. Such a sample complexity result substantially improves the state-of-the-art complexity result. |
| Researcher Affiliation | Academia | Ziyi Chen, Shaocong Ma, Yi Zhou Department of ECE University of Utah Salt Lake City, UT 84112, USA {u1276972,yi.zhou,s.ma}@utah.edu |
| Pseudocode | Yes | Algorithm 1 Stochastic policy extragradient (SPE) for entropy-regularized Markov game |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., links to a repository or explicit statements of code release) for the methodology described. |
| Open Datasets | No | The paper is theoretical and does not involve experimental evaluation on datasets, thus no dataset availability information is provided. |
| Dataset Splits | No | The paper is theoretical and does not involve experimental evaluation on datasets, thus no dataset split information for training, validation, or testing is provided. |
| Hardware Specification | No | The paper is theoretical and does not report experimental results. Therefore, no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not report experimental results. Therefore, no software dependencies with version numbers are mentioned. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithmic design and convergence analysis rather than experimental implementation. No details about hyperparameter values, training configurations, or other system-level settings for an experimental setup are provided. |