Proximal Learning With Opponent-Learning Awareness
Authors: Stephen Zhao, Chris Lu, Roger B. Grosse, Jakob Foerster
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then present practical approximations to the ideal POLA update, which we evaluate in several partially competitive environments with function approximation and opponent modeling. This empirically demonstrates that POLA achieves reciprocity-based cooperation more reliably than LOLA. |
| Researcher Affiliation | Collaboration | Stephen Zhao University of Toronto and Vector Institute stephen.zhao@mail.utoronto.ca Chris Lu FLAIR, University of Oxford christopher.lu@exeter.ox.ac.uk Roger Grosse University of Toronto and Vector Institute rgrosse@cs.toronto.edu Jakob Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1 Outer POLA 2-agent formulation: update for agent 1 Algorithm 2 POLA-Di CE 2-agent formulation: update for agent 1 |
| Open Source Code | Yes | For reproducibility, our code is available at: https://github.com/Silent-Zebra/POLA. |
| Open Datasets | No | The paper describes the experimental environments (IPD, coin game) and how agents interact within them, but does not provide concrete access information (link, DOI, formal citation) for specific datasets used for training, nor does it specify that these environments come with pre-defined public datasets. The experiments are conducted within simulated environments rather than on external, pre-existing datasets. |
| Dataset Splits | No | The paper describes running experiments and evaluating performance but does not explicitly state the use of specific training, validation, or test splits with percentages or counts for data. It refers to 'training' and 'evaluation' but not explicit data partitioning for these phases. |
| Hardware Specification | Yes | Appendix B.5. states: 'All experiments were run on an internal cluster of NVIDIA GeForce RTX 2080 Ti GPUs and Intel Xeon Gold 6248 CPUs, using up to 10 GPUs and 20 CPU cores per experiment. Training times varied widely based on algorithm and environment, from a few minutes to tens of hours for a single seed.' |
| Software Dependencies | No | The paper mentions using PyTorch in Appendix B.4. ('We adapted existing PyTorch implementations of the environments mentioned') but does not provide specific version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | Appendix B.1.4 further discusses hyperparameter settings. For more details on the problem setting, policy parameterization, and hyperparameters, see Appendix B.2. Appendix B.3 provides more detail [on Coin Game setup]. The paper mentions specific settings like 'cooperation factor f R', 'learning rate η', 'penalty strength βout', 'number of outer steps M and inner steps K'. |