Trajectory Diversity for Zero-Shot Coordination
Authors: Andrei Lupu, Brandon Cui, Hengyuan Hu, Jakob Foerster
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Traje Di experimentally. Thanks to two MDPs and a matrix game, we provide empirical insights into the shortcomings of standard approaches and show the suitability of Traje Di in discovering multiple optimal solutions. Afterwards, we proceed to demonstrate that Traje Di scales well to arbitrarily complex settings by using it to improve ZSC scores in the collaborative partially observable card game Hanabi. |
| Researcher Affiliation | Collaboration | 1Mila, Mc Gill University (Work done while at Facebook AI Research) 2Facebook AI Research. |
| Pseudocode | Yes | We present the full Traje Di PDT procedure for ZSC in algorithm 1. Algorithm 1 Traje Di PBT with Common Best Response |
| Open Source Code | Yes | The code is available online and can be run in-browser: https: //bit.ly/33NBw5o |
| Open Datasets | Yes | Finally, we apply algorithm 1 to improve ZSC in Hanabi. We chose this game because it was recently proposed as a challenge in artificial intelligence (Bard et al., 2020), and because it was studied by Hu et al. in the context of ZSC. |
| Dataset Splits | No | The paper discusses training and test (cross-play) performance but does not explicitly mention or detail a separate validation dataset split. |
| Hardware Specification | No | The paper mentions that training is 'very compute intensive' and 'it uses 2 GPUs per agent', but it does not specify the model or type of GPUs or any other hardware components (e.g., CPU, RAM). |
| Software Dependencies | No | The paper does not explicitly provide specific version numbers for any software dependencies or libraries used in their experiments. |
| Experiment Setup | Yes | We implement simple policy-gradient policies and train 10 populations of n agents... we use γ = 1. ...we put a high weight on the Traje Di loss term (α = 4 in eq. 7)... we train four independent pools of Traje Di-regularized policies of size 3... all our policies are trained with OP augmented with an auxiliary task... we prevent agents from seeing the last action used by the partner. |