Model-Free Opponent Shaping
Authors: Christopher Lu, Timon Willi, Christian A Schroeder De Witt, Jakob Foerster
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiment section, we show that M-FOS can exploit naive learners much better than a set of widely used general-sum learning algorithms (Foerster et al., 2018a; Kim et al., 2021). In the IPD, M-FOS discovers a famous strategy known as ZD extortion (Press & Dyson, 2012) when playing against NL agents. Notably, unlike other algorithms, it does so without access to the opponent s underlying learning algorithm. |
| Researcher Affiliation | Academia | 1Department of Engineering Sciences, University of Oxford, Oxford, United Kingdom. Correspondence to: Chris Lu <christopher.lu@exeter.ox.ac.uk>, Timon Willi <timon.willi@exeter.ox.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 General M-FOS 1: Initialize M-FOS parameters θ. 2: while true do 3: Initialize agents parameters ϕi 0, ϕ i 0 . 4: for t = 0 to T do 5: Reset environment 6: Gather trajectories τϕ given ϕi t, ϕ i t 7: Update ϕ i t+1 according to respective learning algorithms 8: Update ϕi t+1 according to meta-policy πθ 9: end for 10: Update θ 11: end while |
| Open Source Code | No | The paper mentions a PPO implementation from a third party but does not provide explicit statements or links for its own source code for the methodology described. |
| Open Datasets | Yes | The paper describes the environments and their rules, such as the Payoff Matrix for the Prisoner s Dilemma (Table 1), Iterated Matching Pennies (Table 2), and the Chicken Game (Table 3), which constitute the experimental data. |
| Dataset Splits | No | The paper describes training procedures and evaluations within game environments but does not specify explicit training/validation/test dataset splits with percentages, counts, or predefined citations as would be typical for a fixed dataset. |
| Hardware Specification | No | The paper mentions general computing resources like 'Oxford s Advanced Research Cluster (ARC)' and 'Cirrus UK National Tier-2 HPC Service' and an 'Oracle for Research Cloud Grant', but does not provide specific hardware details such as GPU or CPU models, processor speeds, or memory amounts. |
| Software Dependencies | No | The paper mentions PPO parameters and refers to a PyTorch implementation in the bibliography, but it does not provide specific version numbers for software libraries or programming languages used (e.g., 'PyTorch 1.9' or 'Python 3.8'). |
| Experiment Setup | Yes | Appendix C. Hyperparameter Details provides detailed experimental setup information, including 'Adam Step Size 0.0002', 'Number of Epochs 4', 'PPO Clipping ϵ 0.2', 'Entropy Coefficient 0.01' for PPO, and network architecture details like 'Number of Actor Hidden Layers 1', 'Size of Actor Hidden Layers [256]'. |