Near-Optimal Reinforcement Learning with Self-Play
Authors: Yu Bai, Chi Jin, Tiancheng Yu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires O(S2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity O(SAB), and a new Nash V-learning algorithm with sample complexity O(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games a learning objective different from finding the Nash equilibrium. |
| Researcher Affiliation | Collaboration | Yu Bai Salesforce Research yu.bai@salesforce.com Chi Jin Princeton University chij@princeton.edu Tiancheng Yu MIT yutc@mit.edu |
| Pseudocode | Yes | Algorithm 1 Optimistic Nash Q-learning; Algorithm 2 Certified Policy ˆµ of Nash Q-learning; Algorithm 3 Optimistic Nash V-learning (the max-player version); Algorithm 4 Certified Policy ˆµ of Nash V-learning |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and focuses on sample complexity in a tabular episodic Markov game model. It does not describe experiments using a specific public dataset for training. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments with specific datasets, and therefore no training/validation/test splits are mentioned. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and focuses on algorithms and proofs; it does not specify any software dependencies with version numbers required for reproducibility. |
| Experiment Setup | No | The paper is theoretical and describes algorithms and their theoretical guarantees. It does not provide concrete experimental setup details such as hyperparameters or training configurations for practical implementation. |