Off-Belief Learning

Authors: Hengyuan Hu, Adam Lerer, Brandon Cui, Luis Pineda, Noam Brown, Jakob Foerster

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then evaluate OBL in both a toy setting and Hanabi. In the toy setting, we demonstrate that OBL learns an optimal grounded policy while other existing methods such as SP and Cognitive Hierarchies do not. In Hanabi, OBL finds fully-grounded policies that reach a score of 20.92 in SP without relying on conventions, an important data point that tells us how well we can perform in this benchmark without conventions.
Researcher Affiliation Industry 1Facebook AI Research.
Pseudocode No The paper provides diagrams and textual descriptions of algorithms (Figure 1), but no formal pseudocode or algorithm blocks.
Open Source Code Yes We will open source our code and all models.
Open Datasets No The paper mentions using the Hanabi environment for experiments, which is a popular benchmark, and also refers to 'human game data collected from an online board game platform' used for Clone Bot, but it does not provide concrete access information (link, DOI, repository, or citation) for these datasets as publicly available or open for direct download by others.
Dataset Splits No As a reinforcement learning paper, the data is generated dynamically through interaction with environments, rather than using static datasets with pre-defined training, validation, and test splits. The paper does not specify percentages or sample counts for such splits of a static dataset.
Hardware Specification No The paper mentions running experiments 'efficiently on GPUs' but does not provide specific details on the GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or other hardware components used.
Software Dependencies No The paper mentions using 'R2D2' as its backbone and various deep learning techniques (e.g., double-DQN, prioritized experience replay, Adam optimizer), but it does not specify concrete version numbers for any software libraries, programming languages (e.g., Python 3.x), or specialized frameworks used for implementation.
Experiment Setup No The paper describes general training processes, such as using parallel environments, Q-function approximation, and a replay buffer. It mentions a 'temperature hyperparameter T' but does not provide specific numerical values for common experimental setup details like learning rates, batch sizes, number of epochs, or detailed network architectures in the main text. It refers to Appendix B for 'neural network design, hyper-parameters and computation cost'.