Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Authors: Noam Brown, Anton Bakhtin, Adam Lerer, Qucheng Gong

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7 Experimental Setup 8 Experimental Results Figure 2 shows Re Be L reaches a level of exploitability in TEH equivalent to running about 125 iterations of full-game tabular CFR. Table 1 shows results for Re Be L in HUNL.
Researcher Affiliation Industry Facebook AI Research EMAIL
Pseudocode Yes Algorithm 1 Re Be L: RL and Search for Imperfect-Information Games
Open Source Code Yes We also show Re Be L approximates a Nash equilibrium in Liar s Dice, another benchmark imperfect-information game, and open source our implementation of it.2 2https://github.com/facebookresearch/rebel
Open Datasets Yes We evaluate on the benchmark imperfect-information games of heads-up no-limit Texas hold em poker (HUNL) and Liar s Dice. The rules for both games are provided in Appendix C.
Dataset Splits No The paper describes a self-play reinforcement learning approach within game environments but does not provide specific training, validation, or test dataset splits with percentages or sample counts.
Hardware Specification No For this reason we use a single machine for training and up to 128 machines with 8 GPUs each for data generation.
Software Dependencies No We use Py Torch [46] to train the networks.
Experiment Setup Yes We use pointwise Huber loss as the criterion for the value function and mean squared error (MSE) over probabilities for the policy. In preliminary experiments we found MSE for the value network and cross entropy for the policy network did worse. See Appendix E for the hyperparameters.