Near-Optimal Reinforcement Learning with Self-Play

Authors: Yu Bai, Chi Jin, Tiancheng Yu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires O(S2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity O(SAB), and a new Nash V-learning algorithm with sample complexity O(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games a learning objective different from finding the Nash equilibrium.
Researcher Affiliation Collaboration Yu Bai Salesforce Research yu.bai@salesforce.com Chi Jin Princeton University chij@princeton.edu Tiancheng Yu MIT yutc@mit.edu
Pseudocode Yes Algorithm 1 Optimistic Nash Q-learning; Algorithm 2 Certified Policy ˆµ of Nash Q-learning; Algorithm 3 Optimistic Nash V-learning (the max-player version); Algorithm 4 Certified Policy ˆµ of Nash V-learning
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and focuses on sample complexity in a tabular episodic Markov game model. It does not describe experiments using a specific public dataset for training.
Dataset Splits No The paper is theoretical and does not describe experiments with specific datasets, and therefore no training/validation/test splits are mentioned.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and focuses on algorithms and proofs; it does not specify any software dependencies with version numbers required for reproducibility.
Experiment Setup No The paper is theoretical and describes algorithms and their theoretical guarantees. It does not provide concrete experimental setup details such as hyperparameters or training configurations for practical implementation.