Trajectory Diversity for Zero-Shot Coordination

Authors: Andrei Lupu, Brandon Cui, Hengyuan Hu, Jakob Foerster

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Traje Di experimentally. Thanks to two MDPs and a matrix game, we provide empirical insights into the shortcomings of standard approaches and show the suitability of Traje Di in discovering multiple optimal solutions. Afterwards, we proceed to demonstrate that Traje Di scales well to arbitrarily complex settings by using it to improve ZSC scores in the collaborative partially observable card game Hanabi.
Researcher Affiliation Collaboration 1Mila, Mc Gill University (Work done while at Facebook AI Research) 2Facebook AI Research.
Pseudocode Yes We present the full Traje Di PDT procedure for ZSC in algorithm 1. Algorithm 1 Traje Di PBT with Common Best Response
Open Source Code Yes The code is available online and can be run in-browser: https: //bit.ly/33NBw5o
Open Datasets Yes Finally, we apply algorithm 1 to improve ZSC in Hanabi. We chose this game because it was recently proposed as a challenge in artificial intelligence (Bard et al., 2020), and because it was studied by Hu et al. in the context of ZSC.
Dataset Splits No The paper discusses training and test (cross-play) performance but does not explicitly mention or detail a separate validation dataset split.
Hardware Specification No The paper mentions that training is 'very compute intensive' and 'it uses 2 GPUs per agent', but it does not specify the model or type of GPUs or any other hardware components (e.g., CPU, RAM).
Software Dependencies No The paper does not explicitly provide specific version numbers for any software dependencies or libraries used in their experiments.
Experiment Setup Yes We implement simple policy-gradient policies and train 10 populations of n agents... we use γ = 1. ...we put a high weight on the Traje Di loss term (α = 4 in eq. 7)... we train four independent pools of Traje Di-regularized policies of size 3... all our policies are trained with OP augmented with an auxiliary task... we prevent agents from seeing the last action used by the partner.