K-level Reasoning for Zero-Shot Coordination in Hanabi
Authors: Brandon Cui, Hengyuan Hu, Luis Pineda, Jakob Foerster
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method and baseline in both self-play (SP), zero-shot coordination (ZSC), ad-hoc teamplay and human-AI settings. For zero-shot coordination, we follow the problem deļ¬nition from [18] and evaluate models through cross-play (XP) where we repeat training 5 times with different seeds and pair the independently trained agents with each other. |
| Researcher Affiliation | Collaboration | Brandon Cui Facebook AI Research bcui@fb.com Hengyuan Hu Facebook AI Research hengyuan@fb.com Luis Pineda Facebook AI Research lep@fb.com Jakob N. Foerster University of Oxford Jakob.foerster@eng.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1: Client-Server Implementation of k-level reasoning, cognitive hierarchies, Sy KLRBR. |
| Open Source Code | No | We will release an open source version of our code and copies of our trained agents later. |
| Open Datasets | No | We used a dataset of 208, 974 games obtained from Board Game Arena (https://en.boardgamearena.com/). While the data source (Board Game Arena) is public, the paper does not provide a direct link or citation to the specific dataset of 208,974 games used for training. |
| Dataset Splits | No | The paper mentions duplicating games for training examples and evaluating models through cross-play (XP) by repeating training 5 times with different seeds, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions running efficiently on GPUs, but it does not specify exact GPU models, CPU models, or other specific hardware configurations. It refers to Appendix A for these details, but Appendix A is not provided in the given text. |
| Software Dependencies | No | The paper mentions using Recurrent Replay Distributed Deep Q-Network (R2D2) and Independent Q-learning (IQL), but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper describes the general training schema, including synchronous training for 24 hours and sequential training halting after 24 hours per level. It mentions using distributed deep recurrent Q-Networks with prioritized replay experience and a centralized replay buffer. However, it defers 'complete training details' to Appendix A, and the provided text does not contain specific hyperparameter values (e.g., learning rate, batch size, optimizer details). |