Posterior Sampling for Competitive RL: Function Approximation and Partial Observation
Authors: Shuang Qiu, Ziyu Dai, Han Zhong, Zhaoran Wang, Zhuoran Yang, Tong Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper investigates posterior sampling algorithms for competitive reinforcement learning (RL) in the context of general function approximations. Focusing on zero-sum Markov games (MGs) under two critical settings, namely self-play and adversarial learning, we first propose the self-play and adversarial generalized eluder coefficient (GEC) as complexity measures for function approximation, capturing the exploration-exploitation trade-off in MGs. Based on self-play GEC, we propose a model-based self-play posterior sampling method to control both players to learn Nash equilibrium, which can successfully handle the partial observability of states. Furthermore, we identify a set of partially observable MG models fitting MG learning with the adversarial policies of the opponent. Incorporating the adversarial GEC, we propose a model-based posterior sampling method for learning adversarial MG with potential partial observability. We further provide low regret bounds for proposed algorithms that can scale sublinearly with the proposed GEC and the number of episodes T. |
| Researcher Affiliation | Academia | Shuang Qiu HKUST masqiu@ust.hk Ziyu Dai New York University zd2365@cims.nyu.edu Han Zhong Peking University hanzhong@stu.pku.cn Zhaoran Wang Northwestern University zhaoranwang@gmail.com Zhuoran Yang Yale University zhuoran.yang@yale.edu Tong Zhang HKUST tongzhang@ust.hk |
| Pseudocode | Yes | Algorithm 1 Model-Based Posterior Sampling for Self-Play (Max-Player), Algorithm 2 Model-Based Posterior Sampling with Adversarial Opponent, Algorithm 3 Model-Based Posterior Sampling for Self-Play (Min-Player) |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and theoretical guarantees (regret bounds) for competitive RL. It does not describe experiments using specific public datasets. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments with dataset splits for training, validation, and testing. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not mention specific software dependencies with version numbers for implementation or experiments. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup including hyperparameters or specific training settings, as it does not report on empirical experiments. |