Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games

Authors: Stephen Marcus McAleer, JB Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, Roy Fox

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In competitive two-agent environments, deep reinforcement learning (RL) methods like Policy Space Response Oracles (PSRO) often increase exploitability between iterations, which is problematic when training in large games. To address this issue, we introduce anytime double oracle (ADO), an algorithm that ensures exploitability does not increase between iterations, and its approximate extensive-form version, anytime PSRO (APSRO). ADO converges to a Nash equilibrium while iteratively reducing exploitability. However, convergence in these algorithms may require adding all of a game s deterministic policies. To improve this, we propose Self-Play PSRO (SP-PSRO), which incorporates an approximately optimal stochastic policy into the population in each iteration. APSRO and SP-PSRO demonstrate lower exploitability and near-monotonic exploitability reduction in games like Leduc poker and Liar s Dice. Empirically, SP-PSRO often converges much faster than APSRO and PSRO, requiring only a few iterations in many games.
Researcher Affiliation Academia 1Department of Computer Science, Carnegie Mellon University 2Department of Computer Science, University of California, Irvine
Pseudocode Yes Algorithm 1 Anytime Double Oracle (ADO); Algorithm 2 Anytime PSRO; Algorithm 3 Self-Play PSRO; Algorithm 4 Policy Space Response Oracle (PSRO); Algorithm 5 Exp3; Algorithm 6 Multiplicative Weights Update
Open Source Code Yes Code for deep RL experiments is available at https://github.com/indylab/sp-psro under the MIT license.
Open Datasets Yes The experiments used game implementations and tools from the Open Spiel library (Lanctot et al., 2019).
Dataset Splits No The paper does not specify traditional training/validation/test dataset splits. It describes iterative training of agents in game environments, where data is generated dynamically.
Hardware Specification Yes Experiments were run on local machine with 128 logical CPU cores, 4 Nvidia RTX 3090 GPUs, and 512GB of RAM.
Software Dependencies Yes Our deep RL code was built on top of the RLlib framework (Liang et al., 2018b), and any hyperparameters not specified are the version 1.0.1 defaults.
Experiment Setup Yes For all games, Q-learning hyperparameters were Open Spiel defaults and were constant between PSRO, APSRO, and SP-PSRO: step size = 0.1 and epsilon = 0.2. Each APSRO/SP-PSRO iteration, we set γ to min (1, k log k (e 1)g ). We target a total of 800, 000 episodes and 20, 000 Exp3 updates per iteration. In deep RL experiments, we use the Multiplicative Weights Update (MWU) algorithm... with a learning rate of 0.1, updating every 10th RL iteration. Our deep RL code was built on top of the RLlib framework (Liang et al., 2018b), and any hyperparameters not specified are the version 1.0.1 defaults. Tables 2 and 3 provide detailed hyperparameters for DDQN.