A Generalized Training Approach for Multiagent Learning
Authors: Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat, Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, Zhe Wang, Guy Lever, Nicolas Heess, Thore Graepel, Remi Munos
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the competitive performance of α-Rank-based PSRO against an exact Nash solver-based PSRO in 2-player Kuhn and Leduc Poker. We then go beyond the reach of prior PSRO applications by considering 3to 5-player poker games, yielding instances where α-Rank achieves faster convergence than approximate Nash solvers, thus establishing it as a favorable general games solver. We also carry out an initial empirical validation in Mu Jo Co soccer, illustrating the feasibility of the proposed approach in another complex domain. |
| Researcher Affiliation | Industry | Paul Muller pmuller@... Shayegan Omidshafiei somidshafiei@... Mark Rowland markrowland@... Karl Tuyls karltuyls@... Julien Perolat perolat@... Siqi Liu liusiqi@... Daniel Hennes hennes@... Luke Marris marris@... Marc Lanctot lanctot@... Edward Hughes edwardhughes@... Zhe Wang zhewang@... Guy Lever guylever@... Nicolas Heess heess@... Thore Graepel thore@... Remi Munos munos@... ...google.com. Deep Mind. |
| Pseudocode | Yes | Algorithm 1 PSRO(M, O), Algorithm 2 Generate Transitive(Actions, Players, meanvalue = [0.0, 1.0], meanprobability = [0.5, 0.5], var = 0.1), Algorithm 3 Generate Cyclic(Actions, Players, var = 0.4), Algorithm 4 General Normal Form Games Generation(Actions, Players), Algorithm 5 PBR Score(Strategy S, Payoff Tensor, Current Player Id, Joint Strategies, Joint Strategy Probability), Algorithm 6 PBR(Payoff Tensor list LM, Joint Strategies per player PJ, Alpharank Probability per Joint Strategy PA, Current Player) |
| Open Source Code | No | The paper mentions using and building upon 'Open Spiel' (Lanctot et al., 2019) with a link to its GitHub repository, and states 'We implemented a version of α-Rank (building on the Open Spiel implementation https://github.com/deepmind/open_spiel/blob/master/ open_spiel/python/egt/alpharank.py)'. However, it does not explicitly state that the code developed for the paper's methodology is open-source or provide a specific link to it. |
| Open Datasets | Yes | We conduct evaluations on games of increasing complexity, extending beyond prior PSRO applications that have focused on two-player zero-sum games. For experimental procedures, see Appendix C. Meta-solver comparisons We consider next the standard benchmarks of Kuhn and Leduc poker (Kuhn, 1950; Southey et al., 2005; Lanctot et al., 2019). We also carry out an initial empirical validation in Mu Jo Co soccer, illustrating the feasibility of the proposed approach in another complex domain (Liu et al., 2019). |
| Dataset Splits | No | The paper describes training processes for agents within game environments and mentions 'train', 'validation', and 'test' as categories of evaluation, but it does not provide specific percentages, sample counts, or methodologies for splitting a dataset into training, validation, and test sets. The experiments are conducted within game simulations rather than on fixed datasets with such splits. |
| Hardware Specification | No | Despite this, we report averages over 2 runs per PSRO M, primarily to capture stochasticity due to differences in machine-specific rounding errors that occur due to the distributed computational platforms we run these experiments on. No specific hardware details (e.g., GPU/CPU models) are mentioned. |
| Software Dependencies | No | The paper mentions using 'Open Spiel' and 'Mu Jo Co' for its experiments, but it does not provide specific version numbers for these or any other software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | For experiments involving the projected replicator dynamics (PRD), we used uniformly-initialized meta-distributions, running PRD for 5e4 iterations, using a step-size of dt = 1e 3, and exploration parameter γ = 1e 10. 100 game simulations were used to estimate the payoff matrix for each possible strategy pair. In Mu Jo Co, a collection of 32 reinforcement learners were trained, 100 simulations were used per entry for the evaluation matrix, and instead of adding one policy per PSRO iteration per player we add three (which corresponds to the 10% best RL agents). Each oracle step in PSRO is composed of 1 billion learning steps of the agents, and we use a 50% probability of training using self-play. |