reproducibilityindex.ai

Near-Optimal Reinforcement Learning with Self-Play

Authors: Yu Bai, Chi Jin, Tiancheng Yu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for ﬁnding an approximate Nash equilibrium requires O(S2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a signiﬁcant gap from the upper bound. This paper closes this gap for the ﬁrst time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity O(SAB), and a new Nash V-learning algorithm with sample complexity O(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a ﬁxed opponent in Markov games a learning objective different from ﬁnding the Nash equilibrium.
Researcher Affiliation	Collaboration	Yu Bai Salesforce Research yu.bai@salesforce.com Chi Jin Princeton University chij@princeton.edu Tiancheng Yu MIT yutc@mit.edu
Pseudocode	Yes	Algorithm 1 Optimistic Nash Q-learning; Algorithm 2 Certiﬁed Policy ˆµ of Nash Q-learning; Algorithm 3 Optimistic Nash V-learning (the max-player version); Algorithm 4 Certiﬁed Policy ˆµ of Nash V-learning
Open Source Code	No	The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets	No	The paper is theoretical and focuses on sample complexity in a tabular episodic Markov game model. It does not describe experiments using a specific public dataset for training.
Dataset Splits	No	The paper is theoretical and does not describe experiments with specific datasets, and therefore no training/validation/test splits are mentioned.
Hardware Specification	No	The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and focuses on algorithms and proofs; it does not specify any software dependencies with version numbers required for reproducibility.
Experiment Setup	No	The paper is theoretical and describes algorithms and their theoretical guarantees. It does not provide concrete experimental setup details such as hyperparameters or training configurations for practical implementation.