Model-Free Opponent Shaping

Authors: Christopher Lu, Timon Willi, Christian A Schroeder De Witt, Jakob Foerster

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiment section, we show that M-FOS can exploit naive learners much better than a set of widely used general-sum learning algorithms (Foerster et al., 2018a; Kim et al., 2021). In the IPD, M-FOS discovers a famous strategy known as ZD extortion (Press & Dyson, 2012) when playing against NL agents. Notably, unlike other algorithms, it does so without access to the opponent s underlying learning algorithm.
Researcher Affiliation Academia 1Department of Engineering Sciences, University of Oxford, Oxford, United Kingdom. Correspondence to: Chris Lu <christopher.lu@exeter.ox.ac.uk>, Timon Willi <timon.willi@exeter.ox.ac.uk>.
Pseudocode Yes Algorithm 1 General M-FOS 1: Initialize M-FOS parameters θ. 2: while true do 3: Initialize agents parameters ϕi 0, ϕ i 0 . 4: for t = 0 to T do 5: Reset environment 6: Gather trajectories τϕ given ϕi t, ϕ i t 7: Update ϕ i t+1 according to respective learning algorithms 8: Update ϕi t+1 according to meta-policy πθ 9: end for 10: Update θ 11: end while
Open Source Code No The paper mentions a PPO implementation from a third party but does not provide explicit statements or links for its own source code for the methodology described.
Open Datasets Yes The paper describes the environments and their rules, such as the Payoff Matrix for the Prisoner s Dilemma (Table 1), Iterated Matching Pennies (Table 2), and the Chicken Game (Table 3), which constitute the experimental data.
Dataset Splits No The paper describes training procedures and evaluations within game environments but does not specify explicit training/validation/test dataset splits with percentages, counts, or predefined citations as would be typical for a fixed dataset.
Hardware Specification No The paper mentions general computing resources like 'Oxford s Advanced Research Cluster (ARC)' and 'Cirrus UK National Tier-2 HPC Service' and an 'Oracle for Research Cloud Grant', but does not provide specific hardware details such as GPU or CPU models, processor speeds, or memory amounts.
Software Dependencies No The paper mentions PPO parameters and refers to a PyTorch implementation in the bibliography, but it does not provide specific version numbers for software libraries or programming languages used (e.g., 'PyTorch 1.9' or 'Python 3.8').
Experiment Setup Yes Appendix C. Hyperparameter Details provides detailed experimental setup information, including 'Adam Step Size 0.0002', 'Number of Epochs 4', 'PPO Clipping ϵ 0.2', 'Entropy Coefficient 0.01' for PPO, and network architecture details like 'Number of Actor Hidden Layers 1', 'Size of Actor Hidden Layers [256]'.