reproducibilityindex.ai

Provable Self-Play Algorithms for Competitive Reinforcement Learning

Authors: Yu Bai, Chi Jin

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We introduce a self-play algorithm Value Iteration with Upper/Lower Conﬁdence Bound (VI-ULCB) and show that it achieves regret O( T) after playing T steps of the game... To the best of our knowledge, our work presents the ﬁrst line of provably sample-efﬁcient self-play algorithms for competitive reinforcement learning.
Researcher Affiliation	Collaboration	1Salesforce Research 2Princeton University. Correspondence to: Yu Bai <yu.bai@salesforce.com>, Chi Jin <chij@princeton.edu>.
Pseudocode	Yes	Algorithm 1 Value Iteration with Upper-Lower Conﬁdence Bound (VI-ULCB); Algorithm 2 Value Iteration after Exploration (VI-Explore)
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	The paper is theoretical and operates within the framework of Markov games; it does not use a specific, named dataset for training or experimentation, nor does it provide access information for any dataset.
Dataset Splits	No	The paper is theoretical and does not conduct experiments on datasets with specified training, validation, or test splits.
Hardware Specification	No	The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe any specific hardware used for running experiments.
Software Dependencies	No	The paper mentions subroutines like 'NASH GENERAL SUM' and 'NASH ZERO SUM', referring to algorithms like 'Lemke-Howson algorithm' and methods for 'convex-concave optimization', but it does not specify any software names with version numbers that are required dependencies for reproducibility.
Experiment Setup	No	The paper describes theoretical algorithms and their properties. It does not include details on experimental setup such as specific hyperparameter values, model initialization, or training schedules.