reproducibilityindex.ai

Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game

Authors: Ziyi Chen, Shaocong Ma, Yi Zhou

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we develop a fully decentralized and sample-efﬁcient stochastic policy extragradient algorithm for solving tabular zero-sum Markov games. In particular, our algorithm utilizes multiple stochastic estimators to accurately estimate the value functions involved in the stochastic updates, and leverages entropy regularization to accelerate the convergence. Speciﬁcally, with a proper entropy-regularization parameter, we prove that the stochastic policy extragradient algorithm has a sample complexity of the order e O( Amax µminϵ5.5(1 γ)13.5 ) for ﬁnding a solution that achieves ϵ-Nash equilibrium duality gap, where Amax is the maximum number of actions between the players, µmin is the lower bound of state stationary distribution, and γ is the discount factor. Such a sample complexity result substantially improves the state-of-the-art complexity result.
Researcher Affiliation	Academia	Ziyi Chen, Shaocong Ma, Yi Zhou Department of ECE University of Utah Salt Lake City, UT 84112, USA {u1276972,yi.zhou,s.ma}@utah.edu
Pseudocode	Yes	Algorithm 1 Stochastic policy extragradient (SPE) for entropy-regularized Markov game
Open Source Code	No	The paper does not provide any concrete access information (e.g., links to a repository or explicit statements of code release) for the methodology described.
Open Datasets	No	The paper is theoretical and does not involve experimental evaluation on datasets, thus no dataset availability information is provided.
Dataset Splits	No	The paper is theoretical and does not involve experimental evaluation on datasets, thus no dataset split information for training, validation, or testing is provided.
Hardware Specification	No	The paper is theoretical and does not report experimental results. Therefore, no hardware specifications are mentioned.
Software Dependencies	No	The paper is theoretical and does not report experimental results. Therefore, no software dependencies with version numbers are mentioned.
Experiment Setup	No	The paper is theoretical and focuses on algorithmic design and convergence analysis rather than experimental implementation. No details about hyperparameter values, training configurations, or other system-level settings for an experimental setup are provided.