reproducibilityindex.ai

Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games

Authors: Stephen Marcus McAleer, JB Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, Roy Fox

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In competitive two-agent environments, deep reinforcement learning (RL) methods like Policy Space Response Oracles (PSRO) often increase exploitability between iterations, which is problematic when training in large games. To address this issue, we introduce anytime double oracle (ADO), an algorithm that ensures exploitability does not increase between iterations, and its approximate extensive-form version, anytime PSRO (APSRO). ADO converges to a Nash equilibrium while iteratively reducing exploitability. However, convergence in these algorithms may require adding all of a game s deterministic policies. To improve this, we propose Self-Play PSRO (SP-PSRO), which incorporates an approximately optimal stochastic policy into the population in each iteration. APSRO and SP-PSRO demonstrate lower exploitability and near-monotonic exploitability reduction in games like Leduc poker and Liar s Dice. Empirically, SP-PSRO often converges much faster than APSRO and PSRO, requiring only a few iterations in many games.
Researcher Affiliation	Academia	1Department of Computer Science, Carnegie Mellon University 2Department of Computer Science, University of California, Irvine
Pseudocode	Yes	Algorithm 1 Anytime Double Oracle (ADO); Algorithm 2 Anytime PSRO; Algorithm 3 Self-Play PSRO; Algorithm 4 Policy Space Response Oracle (PSRO); Algorithm 5 Exp3; Algorithm 6 Multiplicative Weights Update
Open Source Code	Yes	Code for deep RL experiments is available at https://github.com/indylab/sp-psro under the MIT license.
Open Datasets	Yes	The experiments used game implementations and tools from the Open Spiel library (Lanctot et al., 2019).
Dataset Splits	No	The paper does not specify traditional training/validation/test dataset splits. It describes iterative training of agents in game environments, where data is generated dynamically.
Hardware Specification	Yes	Experiments were run on local machine with 128 logical CPU cores, 4 Nvidia RTX 3090 GPUs, and 512GB of RAM.
Software Dependencies	Yes	Our deep RL code was built on top of the RLlib framework (Liang et al., 2018b), and any hyperparameters not specified are the version 1.0.1 defaults.
Experiment Setup	Yes	For all games, Q-learning hyperparameters were Open Spiel defaults and were constant between PSRO, APSRO, and SP-PSRO: step size = 0.1 and epsilon = 0.2. Each APSRO/SP-PSRO iteration, we set γ to min (1, k log k (e 1)g ). We target a total of 800, 000 episodes and 20, 000 Exp3 updates per iteration. In deep RL experiments, we use the Multiplicative Weights Update (MWU) algorithm... with a learning rate of 0.1, updating every 10th RL iteration. Our deep RL code was built on top of the RLlib framework (Liang et al., 2018b), and any hyperparameters not specified are the version 1.0.1 defaults. Tables 2 and 3 provide detailed hyperparameters for DDQN.