Adversarial Diversity in Hanabi

Authors: Brandon Cui, Andrei Lupu, Samuel Sokota, Hengyuan Hu, David J Wu, Jakob Nicolaus Foerster

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement and test our method in Hanabi, a large scale cooperative card game proposed as a challenging benchmark for Dec-POMDP research (Bard et al., 2020). We first see in Table 1 that both the SPWR and ADVERSITY agents achieve high SP, corresponding to high skill in Hanabi, and very low XP scores when paired with their respective repulser, showing that both methods produce policies that are incompatible with their repulser. However, ADVERSITY shows a clear advantage over the SPWR in terms of intra-AXP scores, computed between independent adversary seeds.
Researcher Affiliation Collaboration Brandon Cui Mosaic ML Andrei Lupu Meta AI & FLAIR, University of Oxford Samuel Sokota Carnegie Mellon University Hengyuan Hu Stanford University David J Wu Meta AI Jakob N. Foerster FLAIR, University of Oxford
Pseudocode Yes Algorithm 1 ADVERSITY training at level ℓfor one data collection and training step. We present the two player case. At timestep t, the active player is i and the next player is i.
Open Source Code Yes and open-source our agents to be used for future research on (ad-hoc) coordination.1 (with footnote "1https://github.com/facebookresearch/off-belief-learning") Our implementation is based on the open sourced OBL code with two main modifications.
Open Datasets Yes We implement and test our method in Hanabi, a large scale cooperative card game proposed as a challenging benchmark for Dec-POMDP research (Bard et al., 2020).
Dataset Splits No The paper describes training within the Hanabi environment and evaluation by pairing agents, but it does not specify explicit train/validation/test dataset splits with percentages, sample counts, or predefined partition methods.
Hardware Specification No The paper mentions "All models are on GPUs" and "We run 80 threads in parallel", but it does not provide specific GPU/CPU models, processor types, or detailed computer specifications used for running the experiments.
Software Dependencies No The paper mentions software components like PPO and Adam optimizer, but it does not provide specific version numbers for these or other software dependencies used in the implementation (e.g., "PyTorch 1.x" or "Python 3.x").
Experiment Setup Yes For each adversary, we train a hierarchy of 7 levels, setting λ = 0.25 for l = 1 and decreasing by 0.08 every level (min. 0). All bots are trained for 3000 epochs, each epoch consists of 1000 gradient steps. We run 6400 games in parallel, each adding to a centralized replay buffer. Its size is set to a small value of 1024.