reproducibilityindex.ai

Improving Policies via Search in Cooperative Partially Observable Games

Authors: Adam Lerer, Hengyuan Hu, Jakob Foerster, Noam Brown7187-7194

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the benchmark challenge problem of Hanabi, our search technique greatly improves the performance of every agent we tested and when applied to a policy trained using RL achieves a new state-of-the-art score of 24.61 / 25 in the game, compared to a previous-best of 24.08 / 25.
Researcher Affiliation	Industry	Adam Lerer Facebook AI Research alerer@fb.com Hengyuan Hu Facebook AI Research hengyuan@fb.com Jakob Foerster Facebook AI Research jnf@fb.com Noam Brown Facebook AI Research noambrown@fb.com
Pseudocode	No	A precise description of the algorithm is provided in this paper’s extended version. The provided paper text does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide code for single- and multi-agent search in Hanabi as well as a link to supplementary material at https://github.com/facebookresearch/Hanabi_SPARTA
Open Datasets	Yes	We evaluate our methods in the partially observable, fully cooperative game Hanabi, which at a high level resembles a cooperative extension of solitaire. Hanabi has recently been proposed as a new frontier for AI research (Bard et al. 2019)
Dataset Splits	No	The paper describes training an RL blueprint in a game environment ('train in self-play') rather than using static datasets with explicit train/validation/test splits. No specific dataset split information is provided.
Hardware Specification	Yes	All experiments except the imitation learning of Clone Bot and the reinforcement learning of RLBot were conducted on CPU using machines with Intel R Xeon R E5-2698 CPUs containing 40 cores each.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers.
Experiment Setup	Yes	After a minimum of 100 rollouts per action is performed... If the expected value for an action is not within 2 standard deviations of the expected value of the best action, its future MC rollouts are skipped. Furthermore, we use a conﬁgurable threshold for deviating from the blueprint action... We use a threshold of 0.05 in our experiments.