Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback

Authors: Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is uncoupled, convergent, and rational, with nonasymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an O(t 1 8 ) last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a lastiterate convergence rate of O(t 1 9+ε ) for any ε > 0. Finally, we study Markov games without any assumptions on the dynamics, and show a path convergence rate, a new notion of convergence we define, of O(t 1 10 ). Our algorithm removes the coordination and prior knowledge requirement of [WLZL21a], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [CMZ21, CWC21] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.
Researcher Affiliation Academia Yang Cai Yale University yang.cai@yale.edu Haipeng Luo University of Southern California haipengl@usc.edu Chen-Yu Wei University of Virginia chenyu.wei@virginia.edu Weiqiang Zheng Yale University weiqiang.zheng@yale.edu
Pseudocode Yes Algorithm 1 Matrix Game with Bandit Feedback Algorithm 2 Irreducible Markov Game Algorithm 3 General Markov Game
Open Source Code No The paper does not contain any statements about making source code publicly available, nor does it provide any links to a code repository.
Open Datasets No This paper is theoretical and focuses on algorithm design and convergence rates. It does not describe experiments run on specific datasets, and therefore does not specify public dataset availability in that context.
Dataset Splits No This paper is theoretical and focuses on algorithm design and convergence rates. It does not describe experiments with training/validation/test splits.
Hardware Specification No The paper is theoretical, focusing on algorithms and their convergence properties, and does not mention any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and describes algorithms and proofs. It does not list specific software dependencies with version numbers for implementation or experimentation.
Experiment Setup No The paper defines parameters for its theoretical algorithms (e.g., ηt = t kη, βt = t kβ, ϵt = t kϵ in Algorithm 1), which are part of the algorithm's mathematical definition rather than an experimental setup for empirical evaluation.