Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Near-Optimal Reinforcement Learning with Self-Play
Authors: Yu Bai, Chi Jin, Tiancheng Yu
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires O(S2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity O(SAB), and a new Nash V-learning algorithm with sample complexity O(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games a learning objective different from finding the Nash equilibrium. |
| Researcher Affiliation | Collaboration | Yu Bai Salesforce Research EMAIL Chi Jin Princeton University EMAIL Tiancheng Yu MIT EMAIL |
| Pseudocode | Yes | Algorithm 1 Optimistic Nash Q-learning; Algorithm 2 Certified Policy ˆµ of Nash Q-learning; Algorithm 3 Optimistic Nash V-learning (the max-player version); Algorithm 4 Certified Policy ˆµ of Nash V-learning |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and focuses on sample complexity in a tabular episodic Markov game model. It does not describe experiments using a specific public dataset for training. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments with specific datasets, and therefore no training/validation/test splits are mentioned. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and focuses on algorithms and proofs; it does not specify any software dependencies with version numbers required for reproducibility. |
| Experiment Setup | No | The paper is theoretical and describes algorithms and their theoretical guarantees. It does not provide concrete experimental setup details such as hyperparameters or training configurations for practical implementation. |