Provable Self-Play Algorithms for Competitive Reinforcement Learning

Authors: Yu Bai, Chi Jin

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We introduce a self-play algorithm Value Iteration with Upper/Lower Confidence Bound (VI-ULCB) and show that it achieves regret O( T) after playing T steps of the game... To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.
Researcher Affiliation Collaboration 1Salesforce Research 2Princeton University. Correspondence to: Yu Bai <yu.bai@salesforce.com>, Chi Jin <chij@princeton.edu>.
Pseudocode Yes Algorithm 1 Value Iteration with Upper-Lower Confidence Bound (VI-ULCB); Algorithm 2 Value Iteration after Exploration (VI-Explore)
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and operates within the framework of Markov games; it does not use a specific, named dataset for training or experimentation, nor does it provide access information for any dataset.
Dataset Splits No The paper is theoretical and does not conduct experiments on datasets with specified training, validation, or test splits.
Hardware Specification No The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe any specific hardware used for running experiments.
Software Dependencies No The paper mentions subroutines like 'NASH GENERAL SUM' and 'NASH ZERO SUM', referring to algorithms like 'Lemke-Howson algorithm' and methods for 'convex-concave optimization', but it does not specify any software names with version numbers that are required dependencies for reproducibility.
Experiment Setup No The paper describes theoretical algorithms and their properties. It does not include details on experimental setup such as specific hyperparameter values, model initialization, or training schedules.