A Generalized Training Approach for Multiagent Learning

Authors: Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat, Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, Zhe Wang, Guy Lever, Nicolas Heess, Thore Graepel, Remi Munos

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the competitive performance of α-Rank-based PSRO against an exact Nash solver-based PSRO in 2-player Kuhn and Leduc Poker. We then go beyond the reach of prior PSRO applications by considering 3to 5-player poker games, yielding instances where α-Rank achieves faster convergence than approximate Nash solvers, thus establishing it as a favorable general games solver. We also carry out an initial empirical validation in Mu Jo Co soccer, illustrating the feasibility of the proposed approach in another complex domain.
Researcher Affiliation Industry Paul Muller pmuller@... Shayegan Omidshafiei somidshafiei@... Mark Rowland markrowland@... Karl Tuyls karltuyls@... Julien Perolat perolat@... Siqi Liu liusiqi@... Daniel Hennes hennes@... Luke Marris marris@... Marc Lanctot lanctot@... Edward Hughes edwardhughes@... Zhe Wang zhewang@... Guy Lever guylever@... Nicolas Heess heess@... Thore Graepel thore@... Remi Munos munos@... ...google.com. Deep Mind.
Pseudocode Yes Algorithm 1 PSRO(M, O), Algorithm 2 Generate Transitive(Actions, Players, meanvalue = [0.0, 1.0], meanprobability = [0.5, 0.5], var = 0.1), Algorithm 3 Generate Cyclic(Actions, Players, var = 0.4), Algorithm 4 General Normal Form Games Generation(Actions, Players), Algorithm 5 PBR Score(Strategy S, Payoff Tensor, Current Player Id, Joint Strategies, Joint Strategy Probability), Algorithm 6 PBR(Payoff Tensor list LM, Joint Strategies per player PJ, Alpharank Probability per Joint Strategy PA, Current Player)
Open Source Code No The paper mentions using and building upon 'Open Spiel' (Lanctot et al., 2019) with a link to its GitHub repository, and states 'We implemented a version of α-Rank (building on the Open Spiel implementation https://github.com/deepmind/open_spiel/blob/master/ open_spiel/python/egt/alpharank.py)'. However, it does not explicitly state that the code developed for the paper's methodology is open-source or provide a specific link to it.
Open Datasets Yes We conduct evaluations on games of increasing complexity, extending beyond prior PSRO applications that have focused on two-player zero-sum games. For experimental procedures, see Appendix C. Meta-solver comparisons We consider next the standard benchmarks of Kuhn and Leduc poker (Kuhn, 1950; Southey et al., 2005; Lanctot et al., 2019). We also carry out an initial empirical validation in Mu Jo Co soccer, illustrating the feasibility of the proposed approach in another complex domain (Liu et al., 2019).
Dataset Splits No The paper describes training processes for agents within game environments and mentions 'train', 'validation', and 'test' as categories of evaluation, but it does not provide specific percentages, sample counts, or methodologies for splitting a dataset into training, validation, and test sets. The experiments are conducted within game simulations rather than on fixed datasets with such splits.
Hardware Specification No Despite this, we report averages over 2 runs per PSRO M, primarily to capture stochasticity due to differences in machine-specific rounding errors that occur due to the distributed computational platforms we run these experiments on. No specific hardware details (e.g., GPU/CPU models) are mentioned.
Software Dependencies No The paper mentions using 'Open Spiel' and 'Mu Jo Co' for its experiments, but it does not provide specific version numbers for these or any other software dependencies like programming languages or libraries.
Experiment Setup Yes For experiments involving the projected replicator dynamics (PRD), we used uniformly-initialized meta-distributions, running PRD for 5e4 iterations, using a step-size of dt = 1e 3, and exploration parameter γ = 1e 10. 100 game simulations were used to estimate the payoff matrix for each possible strategy pair. In Mu Jo Co, a collection of 32 reinforcement learners were trained, 100 simulations were used per entry for the evaluation matrix, and instead of adding one policy per PSRO iteration per player we add three (which corresponds to the 10% best RL agents). Each oracle step in PSRO is composed of 1 billion learning steps of the agents, and we use a 50% probability of training using self-play.