Improving Policies via Search in Cooperative Partially Observable Games
Authors: Adam Lerer, Hengyuan Hu, Jakob Foerster, Noam Brown7187-7194
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the benchmark challenge problem of Hanabi, our search technique greatly improves the performance of every agent we tested and when applied to a policy trained using RL achieves a new state-of-the-art score of 24.61 / 25 in the game, compared to a previous-best of 24.08 / 25. |
| Researcher Affiliation | Industry | Adam Lerer Facebook AI Research alerer@fb.com Hengyuan Hu Facebook AI Research hengyuan@fb.com Jakob Foerster Facebook AI Research jnf@fb.com Noam Brown Facebook AI Research noambrown@fb.com |
| Pseudocode | No | A precise description of the algorithm is provided in this paper’s extended version. The provided paper text does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide code for single- and multi-agent search in Hanabi as well as a link to supplementary material at https://github.com/facebookresearch/Hanabi_SPARTA |
| Open Datasets | Yes | We evaluate our methods in the partially observable, fully cooperative game Hanabi, which at a high level resembles a cooperative extension of solitaire. Hanabi has recently been proposed as a new frontier for AI research (Bard et al. 2019) |
| Dataset Splits | No | The paper describes training an RL blueprint in a game environment ('train in self-play') rather than using static datasets with explicit train/validation/test splits. No specific dataset split information is provided. |
| Hardware Specification | Yes | All experiments except the imitation learning of Clone Bot and the reinforcement learning of RLBot were conducted on CPU using machines with Intel R Xeon R E5-2698 CPUs containing 40 cores each. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | After a minimum of 100 rollouts per action is performed... If the expected value for an action is not within 2 standard deviations of the expected value of the best action, its future MC rollouts are skipped. Furthermore, we use a configurable threshold for deviating from the blueprint action... We use a threshold of 0.05 in our experiments. |