Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints
Authors: Dan Qiao, Yu-Xiang Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We design a (policy) elimination based algorithm that achieves a regret of e O(H3S2ABK), while the batch complexity is only O(H + log log K). Furthermore, we prove a batch complexity lower bound Ω( H log A K + log log K) for all algorithms with e O(K) regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity. We design a new policy elimination based algorithm (Algorithm 1) that achieves O(H + log log K) batch complexity and e O(H2S2ABT) regret bound (Theorem 4.1). To our knowledge, this provides the first result under multi-agent RL with low adaptivity. Moreover, the regret bound has optimal dependence on T while the batch complexity is optimal up to logarithmic factors among all algorithms with e O(T) regret bound (due to our Theorem 4.2). We also propose a new low-adaptive algorithm (Algorithm 4) for reward-free exploration in Markov games. It comes with an optimal batch complexity of O(H) and provably identifies an ϵ-approximate Nash policy simultaneously for all possible reward functions (Theorem 5.2). The result improves over previous results in both sample complexity and batch complexity. Section 6. Proof overview |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Santa Barbara (UCSB), USA 2Halıcıoğlu Data Science Institute, University of California, San Diego (UCSD), USA. |
| Pseudocode | Yes | Algorithm 1 Main algorithm Algorithm 2 Crude Exploration Algorithm 3 Fine Exploration Algorithm 4 Algorithm for reward-free case Algorithm 5 Compute Transition Kernel (Estimate Transition) Algorithm 6 Algorithm for bandit games |
| Open Source Code | No | The paper does not provide any statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not use or refer to any publicly available or open datasets for training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments, thus it does not provide details on training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe empirical experiments, therefore no specific hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe empirical experiments or implementations requiring specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and theoretical analysis rather than empirical experiments. Consequently, it does not provide details on experimental setup, hyperparameters, or training configurations. |