Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
K-level Reasoning for Zero-Shot Coordination in Hanabi
Authors: Brandon Cui, Hengyuan Hu, Luis Pineda, Jakob Foerster
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method and baseline in both self-play (SP), zero-shot coordination (ZSC), ad-hoc teamplay and human-AI settings. For zero-shot coordination, we follow the problem de๏ฌnition from [18] and evaluate models through cross-play (XP) where we repeat training 5 times with different seeds and pair the independently trained agents with each other. |
| Researcher Affiliation | Collaboration | Brandon Cui Facebook AI Research EMAIL Hengyuan Hu Facebook AI Research EMAIL Luis Pineda Facebook AI Research EMAIL Jakob N. Foerster University of Oxford EMAIL |
| Pseudocode | Yes | Algorithm 1: Client-Server Implementation of k-level reasoning, cognitive hierarchies, Sy KLRBR. |
| Open Source Code | No | We will release an open source version of our code and copies of our trained agents later. |
| Open Datasets | No | We used a dataset of 208, 974 games obtained from Board Game Arena (https://en.boardgamearena.com/). While the data source (Board Game Arena) is public, the paper does not provide a direct link or citation to the specific dataset of 208,974 games used for training. |
| Dataset Splits | No | The paper mentions duplicating games for training examples and evaluating models through cross-play (XP) by repeating training 5 times with different seeds, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions running efficiently on GPUs, but it does not specify exact GPU models, CPU models, or other specific hardware configurations. It refers to Appendix A for these details, but Appendix A is not provided in the given text. |
| Software Dependencies | No | The paper mentions using Recurrent Replay Distributed Deep Q-Network (R2D2) and Independent Q-learning (IQL), but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper describes the general training schema, including synchronous training for 24 hours and sequential training halting after 24 hours per level. It mentions using distributed deep recurrent Q-Networks with prioritized replay experience and a centralized replay buffer. However, it defers 'complete training details' to Appendix A, and the provided text does not contain specific hyperparameter values (e.g., learning rate, batch size, optimizer details). |