Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games
Authors: Stephen Marcus McAleer, JB Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, Roy Fox
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In competitive two-agent environments, deep reinforcement learning (RL) methods like Policy Space Response Oracles (PSRO) often increase exploitability between iterations, which is problematic when training in large games. To address this issue, we introduce anytime double oracle (ADO), an algorithm that ensures exploitability does not increase between iterations, and its approximate extensive-form version, anytime PSRO (APSRO). ADO converges to a Nash equilibrium while iteratively reducing exploitability. However, convergence in these algorithms may require adding all of a game s deterministic policies. To improve this, we propose Self-Play PSRO (SP-PSRO), which incorporates an approximately optimal stochastic policy into the population in each iteration. APSRO and SP-PSRO demonstrate lower exploitability and near-monotonic exploitability reduction in games like Leduc poker and Liar s Dice. Empirically, SP-PSRO often converges much faster than APSRO and PSRO, requiring only a few iterations in many games. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Carnegie Mellon University 2Department of Computer Science, University of California, Irvine |
| Pseudocode | Yes | Algorithm 1 Anytime Double Oracle (ADO); Algorithm 2 Anytime PSRO; Algorithm 3 Self-Play PSRO; Algorithm 4 Policy Space Response Oracle (PSRO); Algorithm 5 Exp3; Algorithm 6 Multiplicative Weights Update |
| Open Source Code | Yes | Code for deep RL experiments is available at https://github.com/indylab/sp-psro under the MIT license. |
| Open Datasets | Yes | The experiments used game implementations and tools from the Open Spiel library (Lanctot et al., 2019). |
| Dataset Splits | No | The paper does not specify traditional training/validation/test dataset splits. It describes iterative training of agents in game environments, where data is generated dynamically. |
| Hardware Specification | Yes | Experiments were run on local machine with 128 logical CPU cores, 4 Nvidia RTX 3090 GPUs, and 512GB of RAM. |
| Software Dependencies | Yes | Our deep RL code was built on top of the RLlib framework (Liang et al., 2018b), and any hyperparameters not specified are the version 1.0.1 defaults. |
| Experiment Setup | Yes | For all games, Q-learning hyperparameters were Open Spiel defaults and were constant between PSRO, APSRO, and SP-PSRO: step size = 0.1 and epsilon = 0.2. Each APSRO/SP-PSRO iteration, we set γ to min (1, k log k (e 1)g ). We target a total of 800, 000 episodes and 20, 000 Exp3 updates per iteration. In deep RL experiments, we use the Multiplicative Weights Update (MWU) algorithm... with a learning rate of 0.1, updating every 10th RL iteration. Our deep RL code was built on top of the RLlib framework (Liang et al., 2018b), and any hyperparameters not specified are the version 1.0.1 defaults. Tables 2 and 3 provide detailed hyperparameters for DDQN. |