Proximal Learning With Opponent-Learning Awareness

Authors: Stephen Zhao, Chris Lu, Roger B. Grosse, Jakob Foerster

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then present practical approximations to the ideal POLA update, which we evaluate in several partially competitive environments with function approximation and opponent modeling. This empirically demonstrates that POLA achieves reciprocity-based cooperation more reliably than LOLA.
Researcher Affiliation Collaboration Stephen Zhao University of Toronto and Vector Institute stephen.zhao@mail.utoronto.ca Chris Lu FLAIR, University of Oxford christopher.lu@exeter.ox.ac.uk Roger Grosse University of Toronto and Vector Institute rgrosse@cs.toronto.edu Jakob Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk
Pseudocode Yes Algorithm 1 Outer POLA 2-agent formulation: update for agent 1 Algorithm 2 POLA-Di CE 2-agent formulation: update for agent 1
Open Source Code Yes For reproducibility, our code is available at: https://github.com/Silent-Zebra/POLA.
Open Datasets No The paper describes the experimental environments (IPD, coin game) and how agents interact within them, but does not provide concrete access information (link, DOI, formal citation) for specific datasets used for training, nor does it specify that these environments come with pre-defined public datasets. The experiments are conducted within simulated environments rather than on external, pre-existing datasets.
Dataset Splits No The paper describes running experiments and evaluating performance but does not explicitly state the use of specific training, validation, or test splits with percentages or counts for data. It refers to 'training' and 'evaluation' but not explicit data partitioning for these phases.
Hardware Specification Yes Appendix B.5. states: 'All experiments were run on an internal cluster of NVIDIA GeForce RTX 2080 Ti GPUs and Intel Xeon Gold 6248 CPUs, using up to 10 GPUs and 20 CPU cores per experiment. Training times varied widely based on algorithm and environment, from a few minutes to tens of hours for a single seed.'
Software Dependencies No The paper mentions using PyTorch in Appendix B.4. ('We adapted existing PyTorch implementations of the environments mentioned') but does not provide specific version numbers for PyTorch or other software dependencies.
Experiment Setup Yes Appendix B.1.4 further discusses hyperparameter settings. For more details on the problem setting, policy parameterization, and hyperparameters, see Appendix B.2. Appendix B.3 provides more detail [on Coin Game setup]. The paper mentions specific settings like 'cooperation factor f R', 'learning rate η', 'penalty strength βout', 'number of outer steps M and inner steps K'.