Open-ended learning in symmetric zero-sum games
Authors: David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, Thore Graepel
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply PSROr N to two highly nontransitive resource allocation games and find that PSROr N consistently outperforms the existing alternatives. We investigated the performance of the proposed algorithms in two highly nontransitive resource allocation games. ... Figure 4. Performance of PSROr N relative to self-play, PSROU and PSRON on Blotto (left) and Differentiable Lotto (right). In all cases, the relative performance of PSROr N is positive, and therefore outperforms the other algorithms. |
| Researcher Affiliation | Industry | 1DeepMind. Correspondence to: <dbalduzzi@google.com>. |
| Pseudocode | Yes | Algorithm 1 Optimization (against a fixed opponent)... Algorithm 2 Self-play... Algorithm 3 Response to Nash (PSRON)... Algorithm 4 Response to rectified Nash (PSROr N). |
| Open Source Code | No | The paper does not contain any statement about releasing source code or provide any links to a code repository for the described methodology. |
| Open Datasets | Yes | We investigated the performance of the proposed algorithms in two highly nontransitive resource allocation games. Colonel Blotto (Borel, 1921; Tukey, 1949; Roberson, 2006) ... In Blotto, we investigate performance for a = 3 areas and c = 10 coins over k = 1000 games. Differentiable Lotto is inspired by continuous Lotto (Hart, 2008). ... Differentiable Lotto experiments are from k = 500 games with c = 9 customers chosen uniformly at random in the square [ 1, 1]2. |
| Dataset Splits | No | The paper describes experiments within game simulations (Colonel Blotto and Differentiable Lotto) and specifies the parameters for these simulations, but it does not mention traditional training, validation, or test dataset splits for a fixed dataset, as the data is generated through game play. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or specific cloud instance types used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'maximum a posteriory policy optimization (MPO) (Abdolmaleki et al., 2018)' and 'gradient ascent' as oracles, but does not provide specific software names with version numbers for any libraries or frameworks used. |
| Experiment Setup | Yes | In Blotto, we investigate performance for a = 3 areas and c = 10 coins over k = 1000 games. An agent outputs a vector in R3 which is passed to a softmax, 10 and discretized to obtain three integers summing to 10. Differentiable Lotto experiments are from k = 500 games with c = 9 customers chosen uniformly at random in the square [ 1, 1]2. ... We impose agents to have width equal one. |