Off-Team Learning
Authors: Brandon Cui, Hengyuan Hu, Andrei Lupu, Samuel Sokota, Jakob Foerster
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate these methods in variants of Hanabi. We evaluate OT-BL empirically and show that it ameliorates the covariate shift experienced by belief models when training OBL and outperforms vanilla OBL in a 2-life variant of the card game Hanabi. |
| Researcher Affiliation | Collaboration | Brandon Cui Mosaic ML brandon@mosaicml.com Hengyuan Hu Stanford University hengyuan@cs.stanford.edu Andrei Lupu Meta AI & FLAIR, University of Oxford alupu@meta.com Samuel Sokota Carnegie Mellon University ssokota@andrew.cmu.edu Jakob N. Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk |
| Pseudocode | Yes | We give pseudocode in Algorithms 1 and 2, and visualizations are provided in Figure 1 and Figure 2. |
| Open Source Code | Yes | We implement our algorithms based on the open-sourced code for OBL 4 and '4https://github.com/facebookresearch/off-belief-learning'. Additionally, the paper's checklist includes: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]' |
| Open Datasets | Yes | Hanabi [1] is a benchmark for Dec-POMDP research. [1] N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. |
| Dataset Splits | No | The paper operates within a reinforcement learning framework, where evaluation is primarily done through simulation and cross-play scores. It does not provide explicit training, validation, and test dataset splits as typically found in supervised learning. |
| Hardware Specification | No | The paper mentions 'a distributed recurrent Q-learning method with parallel environment workers and centralized replay buffer and trainer' but does not specify any particular hardware components like CPU/GPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper refers to 'R2D2 [8]' as the backbone and that algorithms are implemented based on 'open-sourced code for OBL 4'. However, it does not explicitly list specific version numbers for its software dependencies within the main body of the paper. |
| Experiment Setup | Yes | We implement our algorithms based on the open-sourced code for OBL 4 and extend it with ideas from synchronous training [10, 5], training all models simultaneously, thus enabling effective implementations of OT-BL and OT-OBL. When synchronously training OBL models, every n = 50 steps we save a copy of the model then query for and update all dependencies. We follow practices in the original OBL paper [7] to train each policy and belief model. The backbone is R2D2 [8], a distributed recurrent Q-learning method with parallel environment workers and centralized replay buffer and trainer. We briefly describe the loss functions for the Q-network and belief models here. For more details, please refer to Appendix A.1 or the OBL paper [7]. |