Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation
Authors: Yue Guan, Qifan Zhang, Panagiotis Tsiotras
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results applied to a number of stochastic games verify that the proposed algorithm converges to a Nash equilibrium, while exhibiting a major speed-up over existing algorithms. |
| Researcher Affiliation | Academia | Yue Guan * , Qifan Zhang * and Panagiotis Tsiotras Georgia Institute of Technology {yguan44, qzhang410, tsiotras}@gatech.edu |
| Pseudocode | Yes | Algorithm 1: Baseline SNQ2 in the c MDP Settings; Algorithm 2: SNQ2-Learning Algorithm; Algorithm 3: Dynamic Schedule for M and β |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | To evaluate the performance of the proposed algorithm, we tested and compared SNQ2L with four existing algorithms (Minimax-Q [Littman, 1994], Two-agent Soft-Q [Grau-Moya et al., 2018], Wo LF-PHC [Bowling and Veloso, 2001], Single-Q [Tan, 1993]) for three zero-sum game environments: a Soccer game as in [Littman, 1994], a two-agent Pursuit-Evasion Game (PEG) [Guan et al., 2020] and a sequential Rock-Paper-Scissor game (s RPS). |
| Dataset Splits | No | The paper describes the game environments and evaluation criteria, but it does not specify how the data (or game episodes) were split into training, validation, and test sets. It does not use these terms in the context of data partitioning for reproduction. |
| Hardware Specification | Yes | Implemented in a Python environment with AMD Ryzen 1920x. Matrix games at each state are solved via Scipy s linprog. |
| Software Dependencies | No | The paper mentions 'Python environment' and 'Scipy' but does not provide specific version numbers for these or any other software libraries, which are necessary for reproducibility. |
| Experiment Setup | Yes | The default number of episodes between two Nash prior policy updates M0 and the default decay rate of the inverse temperature λ (0, 1) are given initially as M0 = Nstates Naction pairs α0 Tmax , λ = βend 1/Nupdates , where α0 is the initial learning rate and Tmax is the maximum length of a learning episode; β0 and βend are the initial and estimated final magnitude for both βop and βpl, and Nupdates is the estimated number of prior updates. This value of M0 allows the algorithm to properly explore the state-action pairs so that the first prior update is performed with an informed Q-function. In our numerical experiments we found that β0 = 20, βend = 0.1 and Nupdates = 10 are a good set of values. |