Provable Self-Play Algorithms for Competitive Reinforcement Learning
Authors: Yu Bai, Chi Jin
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We introduce a self-play algorithm Value Iteration with Upper/Lower Confidence Bound (VI-ULCB) and show that it achieves regret O( T) after playing T steps of the game... To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning. |
| Researcher Affiliation | Collaboration | 1Salesforce Research 2Princeton University. Correspondence to: Yu Bai <yu.bai@salesforce.com>, Chi Jin <chij@princeton.edu>. |
| Pseudocode | Yes | Algorithm 1 Value Iteration with Upper-Lower Confidence Bound (VI-ULCB); Algorithm 2 Value Iteration after Exploration (VI-Explore) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and operates within the framework of Markov games; it does not use a specific, named dataset for training or experimentation, nor does it provide access information for any dataset. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments on datasets with specified training, validation, or test splits. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe any specific hardware used for running experiments. |
| Software Dependencies | No | The paper mentions subroutines like 'NASH GENERAL SUM' and 'NASH ZERO SUM', referring to algorithms like 'Lemke-Howson algorithm' and methods for 'convex-concave optimization', but it does not specify any software names with version numbers that are required dependencies for reproducibility. |
| Experiment Setup | No | The paper describes theoretical algorithms and their properties. It does not include details on experimental setup such as specific hyperparameter values, model initialization, or training schedules. |