Stable Opponent Shaping in Differentiable Games

Authors: Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, Shimon Whiteson

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of SOS in three differentiable games. We first showcase opponent shaping and superiority over LA/CO/SGA/NL in the Iterated Prisoner s Dilemma (IPD). This leaves SOS and LOLA, which have differed only in theory up to now. We bridge this gap by showing that SOS always outperforms LOLA in the tandem game, avoiding arrogant behaviour by decaying p while LOLA overshoots. Finally we implement a more involved GAN setup, testing for mode collapse and mode hopping when learning Gaussian mixture distributions.
Researcher Affiliation Collaboration Alistair Letcher1 Jakob Foerster1 David Balduzzi2 Tim Rockt aschel3 Shimon Whiteson1 1University of Oxford 2Deep Mind 3University College London
Pseudocode Yes Algorithm 1: Stable Opponent Shaping 1 Initialise θ randomly and fix hyperparameters a, b (0, 1). 2 while not done do 3 Compute ξ0 = (I αHo)ξ and χ = diag(H o L) at θ. 4 if αχ, ξ0 > 0 then p1 = 1 else p1 = min n 1, a ξ0 2 5 if ξ < b then p2 = ξ 2 else p2 = 1 6 Let p = min{p1, p2}, compute ξp = ξ0 pαχ and assign θ θ αξp.
Open Source Code No The paper does not include an explicit statement or link indicating that the source code for their methodology is publicly available.
Open Datasets No The paper describes the experimental setups (IPD, Tandem, Gaussian mixtures) and how data is sampled for the latter, but it does not provide concrete access information (e.g., links, DOIs, repository names, or formal citations with authors and years) for these datasets to confirm their public availability.
Dataset Splits No The paper mentions running "300 training episodes" for different games but does not specify any training, validation, or test dataset splits (e.g., percentages or exact counts) to reproduce the partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud computing resources) used to conduct the experiments.
Software Dependencies No The paper mentions that the stop-gradient operator is implemented in "Tensor Flow" and "Py Torch". However, it does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments.
Experiment Setup Yes IPD: We run 300 training episodes for SOS, LA, CO, SGA and NL. The parameters are initialised following a normal distribution around 1/2 probability of cooperation, with unit variance. We fix α = 1 and γ = 0.96, following Foerster et al. (2018). We choose a = 0.5 and b = 0.1 for SOS. Tandem: Here we fix a = b = 0.5 and α = 0.1. Gaussian mixtures: Learning rates are chosen by grid search at iteration 8k, with a = 0.5 and b = 0.1 for SOS, following the same reasoning as the IPD.