Improving Policies via Search in Cooperative Partially Observable Games

Authors: Adam Lerer, Hengyuan Hu, Jakob Foerster, Noam Brown7187-7194

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the benchmark challenge problem of Hanabi, our search technique greatly improves the performance of every agent we tested and when applied to a policy trained using RL achieves a new state-of-the-art score of 24.61 / 25 in the game, compared to a previous-best of 24.08 / 25.
Researcher Affiliation Industry Adam Lerer Facebook AI Research alerer@fb.com Hengyuan Hu Facebook AI Research hengyuan@fb.com Jakob Foerster Facebook AI Research jnf@fb.com Noam Brown Facebook AI Research noambrown@fb.com
Pseudocode No A precise description of the algorithm is provided in this paper’s extended version. The provided paper text does not include structured pseudocode or algorithm blocks.
Open Source Code Yes We provide code for single- and multi-agent search in Hanabi as well as a link to supplementary material at https://github.com/facebookresearch/Hanabi_SPARTA
Open Datasets Yes We evaluate our methods in the partially observable, fully cooperative game Hanabi, which at a high level resembles a cooperative extension of solitaire. Hanabi has recently been proposed as a new frontier for AI research (Bard et al. 2019)
Dataset Splits No The paper describes training an RL blueprint in a game environment ('train in self-play') rather than using static datasets with explicit train/validation/test splits. No specific dataset split information is provided.
Hardware Specification Yes All experiments except the imitation learning of Clone Bot and the reinforcement learning of RLBot were conducted on CPU using machines with Intel R Xeon R E5-2698 CPUs containing 40 cores each.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes After a minimum of 100 rollouts per action is performed... If the expected value for an action is not within 2 standard deviations of the expected value of the best action, its future MC rollouts are skipped. Furthermore, we use a configurable threshold for deviating from the blueprint action... We use a threshold of 0.05 in our experiments.