Off-Team Learning

Authors: Brandon Cui, Hengyuan Hu, Andrei Lupu, Samuel Sokota, Jakob Foerster

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate these methods in variants of Hanabi. We evaluate OT-BL empirically and show that it ameliorates the covariate shift experienced by belief models when training OBL and outperforms vanilla OBL in a 2-life variant of the card game Hanabi.
Researcher Affiliation Collaboration Brandon Cui Mosaic ML brandon@mosaicml.com Hengyuan Hu Stanford University hengyuan@cs.stanford.edu Andrei Lupu Meta AI & FLAIR, University of Oxford alupu@meta.com Samuel Sokota Carnegie Mellon University ssokota@andrew.cmu.edu Jakob N. Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk
Pseudocode Yes We give pseudocode in Algorithms 1 and 2, and visualizations are provided in Figure 1 and Figure 2.
Open Source Code Yes We implement our algorithms based on the open-sourced code for OBL 4 and '4https://github.com/facebookresearch/off-belief-learning'. Additionally, the paper's checklist includes: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]'
Open Datasets Yes Hanabi [1] is a benchmark for Dec-POMDP research. [1] N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.
Dataset Splits No The paper operates within a reinforcement learning framework, where evaluation is primarily done through simulation and cross-play scores. It does not provide explicit training, validation, and test dataset splits as typically found in supervised learning.
Hardware Specification No The paper mentions 'a distributed recurrent Q-learning method with parallel environment workers and centralized replay buffer and trainer' but does not specify any particular hardware components like CPU/GPU models, memory, or cloud instance types used for running experiments.
Software Dependencies No The paper refers to 'R2D2 [8]' as the backbone and that algorithms are implemented based on 'open-sourced code for OBL 4'. However, it does not explicitly list specific version numbers for its software dependencies within the main body of the paper.
Experiment Setup Yes We implement our algorithms based on the open-sourced code for OBL 4 and extend it with ideas from synchronous training [10, 5], training all models simultaneously, thus enabling effective implementations of OT-BL and OT-OBL. When synchronously training OBL models, every n = 50 steps we save a copy of the model then query for and update all dependencies. We follow practices in the original OBL paper [7] to train each policy and belief model. The backbone is R2D2 [8], a distributed recurrent Q-learning method with parallel environment workers and centralized replay buffer and trainer. We briefly describe the loss functions for the Q-network and belief models here. For more details, please refer to Appendix A.1 or the OBL paper [7].