Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Combining Deep Reinforcement Learning and Search for Imperfect-Information Games
Authors: Noam Brown, Anton Bakhtin, Adam Lerer, Qucheng Gong
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7 Experimental Setup 8 Experimental Results Figure 2 shows Re Be L reaches a level of exploitability in TEH equivalent to running about 125 iterations of full-game tabular CFR. Table 1 shows results for Re Be L in HUNL. |
| Researcher Affiliation | Industry | Facebook AI Research EMAIL |
| Pseudocode | Yes | Algorithm 1 Re Be L: RL and Search for Imperfect-Information Games |
| Open Source Code | Yes | We also show Re Be L approximates a Nash equilibrium in Liar s Dice, another benchmark imperfect-information game, and open source our implementation of it.2 2https://github.com/facebookresearch/rebel |
| Open Datasets | Yes | We evaluate on the benchmark imperfect-information games of heads-up no-limit Texas hold em poker (HUNL) and Liar s Dice. The rules for both games are provided in Appendix C. |
| Dataset Splits | No | The paper describes a self-play reinforcement learning approach within game environments but does not provide specific training, validation, or test dataset splits with percentages or sample counts. |
| Hardware Specification | No | For this reason we use a single machine for training and up to 128 machines with 8 GPUs each for data generation. |
| Software Dependencies | No | We use Py Torch [46] to train the networks. |
| Experiment Setup | Yes | We use pointwise Huber loss as the criterion for the value function and mean squared error (MSE) over probabilities for the policy. In preliminary experiments we found MSE for the value network and cross entropy for the policy network did worse. See Appendix E for the hyperparameters. |