Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game

Authors: Ziyi Chen, Shaocong Ma, Yi Zhou

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we develop a fully decentralized and sample-efficient stochastic policy extragradient algorithm for solving tabular zero-sum Markov games. In particular, our algorithm utilizes multiple stochastic estimators to accurately estimate the value functions involved in the stochastic updates, and leverages entropy regularization to accelerate the convergence. Specifically, with a proper entropy-regularization parameter, we prove that the stochastic policy extragradient algorithm has a sample complexity of the order e O( Amax µminϵ5.5(1 γ)13.5 ) for finding a solution that achieves ϵ-Nash equilibrium duality gap, where Amax is the maximum number of actions between the players, µmin is the lower bound of state stationary distribution, and γ is the discount factor. Such a sample complexity result substantially improves the state-of-the-art complexity result.
Researcher Affiliation Academia Ziyi Chen, Shaocong Ma, Yi Zhou Department of ECE University of Utah Salt Lake City, UT 84112, USA {u1276972,yi.zhou,s.ma}@utah.edu
Pseudocode Yes Algorithm 1 Stochastic policy extragradient (SPE) for entropy-regularized Markov game
Open Source Code No The paper does not provide any concrete access information (e.g., links to a repository or explicit statements of code release) for the methodology described.
Open Datasets No The paper is theoretical and does not involve experimental evaluation on datasets, thus no dataset availability information is provided.
Dataset Splits No The paper is theoretical and does not involve experimental evaluation on datasets, thus no dataset split information for training, validation, or testing is provided.
Hardware Specification No The paper is theoretical and does not report experimental results. Therefore, no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not report experimental results. Therefore, no software dependencies with version numbers are mentioned.
Experiment Setup No The paper is theoretical and focuses on algorithmic design and convergence analysis rather than experimental implementation. No details about hyperparameter values, training configurations, or other system-level settings for an experimental setup are provided.