A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games

Authors: Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, Tong Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Theoretical analysis demonstrates that the posterior sampling algorithm admits a T-regret bound for problems with a low multi-agent decoupling coefficient, which is a new complexity measure for MGs, where T denotes the number of episodes. When specialized to linear MGs, the obtained regret bound matches the state-of-the-art results. To the best of our knowledge, this is the first provably efficient posterior sampling algorithm for MGs with frequentist regret guarantees, which enriches the toolbox for MGs and promotes the broad applicability of posterior sampling.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology; 2Center for Data Science, Peking University; 3University of Virginia; 4Google Research.
Pseudocode Yes Algorithm 1 Conditional Posterior Sampling with Booster; Algorithm 2 Main(F, η, D, T, λ); Algorithm 3 Booster(F, η, D, µf, T, λ)
Open Source Code No The paper does not provide any statement or link regarding the availability of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not conduct experiments on specific datasets. Therefore, no information about publicly available training datasets is provided.
Dataset Splits No The paper is theoretical and does not conduct experiments. Therefore, no information regarding dataset splits for validation is provided.
Hardware Specification No The paper is theoretical and does not describe empirical experiments. Therefore, no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not describe empirical experiments. Therefore, no software dependencies with version numbers are listed.
Experiment Setup No The paper is theoretical and does not describe empirical experiments with hyperparameter tuning or training configurations. Therefore, no experimental setup details are provided.