reproducibilityindex.ai

Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback

Authors: Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is uncoupled, convergent, and rational, with nonasymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an O(t 1 8 ) last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a lastiterate convergence rate of O(t 1 9+ε ) for any ε > 0. Finally, we study Markov games without any assumptions on the dynamics, and show a path convergence rate, a new notion of convergence we define, of O(t 1 10 ). Our algorithm removes the coordination and prior knowledge requirement of [WLZL21a], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [CMZ21, CWC21] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.
Researcher Affiliation	Academia	Yang Cai Yale University yang.cai@yale.edu Haipeng Luo University of Southern California haipengl@usc.edu Chen-Yu Wei University of Virginia chenyu.wei@virginia.edu Weiqiang Zheng Yale University weiqiang.zheng@yale.edu
Pseudocode	Yes	Algorithm 1 Matrix Game with Bandit Feedback Algorithm 2 Irreducible Markov Game Algorithm 3 General Markov Game
Open Source Code	No	The paper does not contain any statements about making source code publicly available, nor does it provide any links to a code repository.
Open Datasets	No	This paper is theoretical and focuses on algorithm design and convergence rates. It does not describe experiments run on specific datasets, and therefore does not specify public dataset availability in that context.
Dataset Splits	No	This paper is theoretical and focuses on algorithm design and convergence rates. It does not describe experiments with training/validation/test splits.
Hardware Specification	No	The paper is theoretical, focusing on algorithms and their convergence properties, and does not mention any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and describes algorithms and proofs. It does not list specific software dependencies with version numbers for implementation or experimentation.
Experiment Setup	No	The paper defines parameters for its theoretical algorithms (e.g., ηt = t kη, βt = t kβ, ϵt = t kϵ in Algorithm 1), which are part of the algorithm's mathematical definition rather than an experimental setup for empirical evaluation.