reproducibilityindex.ai

Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Authors: Yue Guan, Qifan Zhang, Panagiotis Tsiotras

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results applied to a number of stochastic games verify that the proposed algorithm converges to a Nash equilibrium, while exhibiting a major speed-up over existing algorithms.
Researcher Affiliation	Academia	Yue Guan * , Qifan Zhang * and Panagiotis Tsiotras Georgia Institute of Technology {yguan44, qzhang410, tsiotras}@gatech.edu
Pseudocode	Yes	Algorithm 1: Baseline SNQ2 in the c MDP Settings; Algorithm 2: SNQ2-Learning Algorithm; Algorithm 3: Dynamic Schedule for M and β
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	To evaluate the performance of the proposed algorithm, we tested and compared SNQ2L with four existing algorithms (Minimax-Q [Littman, 1994], Two-agent Soft-Q [Grau-Moya et al., 2018], Wo LF-PHC [Bowling and Veloso, 2001], Single-Q [Tan, 1993]) for three zero-sum game environments: a Soccer game as in [Littman, 1994], a two-agent Pursuit-Evasion Game (PEG) [Guan et al., 2020] and a sequential Rock-Paper-Scissor game (s RPS).
Dataset Splits	No	The paper describes the game environments and evaluation criteria, but it does not specify how the data (or game episodes) were split into training, validation, and test sets. It does not use these terms in the context of data partitioning for reproduction.
Hardware Specification	Yes	Implemented in a Python environment with AMD Ryzen 1920x. Matrix games at each state are solved via Scipy s linprog.
Software Dependencies	No	The paper mentions 'Python environment' and 'Scipy' but does not provide specific version numbers for these or any other software libraries, which are necessary for reproducibility.
Experiment Setup	Yes	The default number of episodes between two Nash prior policy updates M0 and the default decay rate of the inverse temperature λ (0, 1) are given initially as M0 = Nstates Naction pairs α0 Tmax , λ = βend 1/Nupdates , where α0 is the initial learning rate and Tmax is the maximum length of a learning episode; β0 and βend are the initial and estimated ﬁnal magnitude for both βop and βpl, and Nupdates is the estimated number of prior updates. This value of M0 allows the algorithm to properly explore the state-action pairs so that the ﬁrst prior update is performed with an informed Q-function. In our numerical experiments we found that β0 = 20, βend = 0.1 and Nupdates = 10 are a good set of values.