Online Learning in Unknown Markov Games

Authors: Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present an algorithm that achieves a sublinear O(K 2/3) regret after K episodes. This is the first sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents action spaces. As a result, even when the opponents actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents. Theorem 2 (Regret bounds). For any p (0, 1), let ι = log(HSAK/p). If we run V-OL with our hyperparameter specification (3) for some large constant c > 0 and G 1 in an online two-player zero-sum MG, then with probability at least 1 p, the regret in K episodes satisfies Regret(K) = O GH3Sι2 + GH5SAKι + G 1KH.
Researcher Affiliation Academia 1Department of EECS, MIT 2Department of Computer Science, Princeton University.
Pseudocode Yes Algorithm 1 Optimistic Nash V-learning for Online Learning (V-OL)
Open Source Code No The paper does not contain any statement about making its source code available or provide a link to a code repository.
Open Datasets No The paper is theoretical and does not describe the use of any specific datasets for training, nor does it provide information about their public availability.
Dataset Splits No As this is a theoretical paper, it does not describe experimental validation or dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers.
Experiment Setup No The paper is theoretical and describes an algorithm's parameters as part of its theoretical formulation (e.g., hyperparameter specification (3)), but does not provide specific experimental setup details like concrete hyperparameter values or training configurations for an empirical study.