Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Authors: Dan Qiao, Yu-Xiang Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We design a (policy) elimination based algorithm that achieves a regret of e O(H3S2ABK), while the batch complexity is only O(H + log log K). Furthermore, we prove a batch complexity lower bound Ω( H log A K + log log K) for all algorithms with e O(K) regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity. We design a new policy elimination based algorithm (Algorithm 1) that achieves O(H + log log K) batch complexity and e O(H2S2ABT) regret bound (Theorem 4.1). To our knowledge, this provides the first result under multi-agent RL with low adaptivity. Moreover, the regret bound has optimal dependence on T while the batch complexity is optimal up to logarithmic factors among all algorithms with e O(T) regret bound (due to our Theorem 4.2). We also propose a new low-adaptive algorithm (Algorithm 4) for reward-free exploration in Markov games. It comes with an optimal batch complexity of O(H) and provably identifies an ϵ-approximate Nash policy simultaneously for all possible reward functions (Theorem 5.2). The result improves over previous results in both sample complexity and batch complexity. Section 6. Proof overview
Researcher Affiliation Academia 1Department of Computer Science, University of California, Santa Barbara (UCSB), USA 2Halıcıoğlu Data Science Institute, University of California, San Diego (UCSD), USA.
Pseudocode Yes Algorithm 1 Main algorithm Algorithm 2 Crude Exploration Algorithm 3 Fine Exploration Algorithm 4 Algorithm for reward-free case Algorithm 5 Compute Transition Kernel (Estimate Transition) Algorithm 6 Algorithm for bandit games
Open Source Code No The paper does not provide any statements or links indicating the release of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not use or refer to any publicly available or open datasets for training or evaluation.
Dataset Splits No The paper is theoretical and does not describe experiments, thus it does not provide details on training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical and does not describe empirical experiments, therefore no specific hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not describe empirical experiments or implementations requiring specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and focuses on algorithm design and theoretical analysis rather than empirical experiments. Consequently, it does not provide details on experimental setup, hyperparameters, or training configurations.