Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback
Authors: Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is uncoupled, convergent, and rational, with nonasymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an O(t 1 8 ) last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a lastiterate convergence rate of O(t 1 9+ε ) for any ε > 0. Finally, we study Markov games without any assumptions on the dynamics, and show a path convergence rate, a new notion of convergence we define, of O(t 1 10 ). Our algorithm removes the coordination and prior knowledge requirement of [WLZL21a], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [CMZ21, CWC21] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled. |
| Researcher Affiliation | Academia | Yang Cai Yale University yang.cai@yale.edu Haipeng Luo University of Southern California haipengl@usc.edu Chen-Yu Wei University of Virginia chenyu.wei@virginia.edu Weiqiang Zheng Yale University weiqiang.zheng@yale.edu |
| Pseudocode | Yes | Algorithm 1 Matrix Game with Bandit Feedback Algorithm 2 Irreducible Markov Game Algorithm 3 General Markov Game |
| Open Source Code | No | The paper does not contain any statements about making source code publicly available, nor does it provide any links to a code repository. |
| Open Datasets | No | This paper is theoretical and focuses on algorithm design and convergence rates. It does not describe experiments run on specific datasets, and therefore does not specify public dataset availability in that context. |
| Dataset Splits | No | This paper is theoretical and focuses on algorithm design and convergence rates. It does not describe experiments with training/validation/test splits. |
| Hardware Specification | No | The paper is theoretical, focusing on algorithms and their convergence properties, and does not mention any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and describes algorithms and proofs. It does not list specific software dependencies with version numbers for implementation or experimentation. |
| Experiment Setup | No | The paper defines parameters for its theoretical algorithms (e.g., ηt = t kη, βt = t kβ, ϵt = t kϵ in Algorithm 1), which are part of the algorithm's mathematical definition rather than an experimental setup for empirical evaluation. |