Double Neural Counterfactual Regret Minimization

Authors: Hui Li, Kailiang Hu, Shaohua Zhang, Yuan Qi, Le Song

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To understand the contributions of various components in DNCFR algorithm, we will first conduct a set of ablation studies. Then we will compare DNCFR with tabular CFR and deep reinforcement learning method such as NFSP, which is a prior leading function approximation method in IIGs. At last, we conduct experiments on heads-up no-limit Texas Hold em (HUNL) to show the scalability of DNCFR algorithm.
Researcher Affiliation Collaboration Ant Financial Services Group Georgia Institute of Technology {ken.lh, hkl163251, yaohua.zsh, yuan.qi, le.song}@antfin.com lsong@cc.gatech.edu
Pseudocode Yes Algorithm 1: CFR Algorithm, Algorithm 2: DNCFR Algorithm, Algorithm 3: Optimization of Deep Neural Network, Algorithm 4: Mini-Batch RS-MCCFR with Double Neural Networks
Open Source Code No The paper states it reimplemented Deep Stack due to lack of available open source codes for such solvers, and mentions an example code for Leduc Hold em (https://github.com/lifrordi/Deep Stack-Leduc) which is not their own DNCFR code. It does not explicitly provide a link or statement for the open-sourcing of the DNCFR code described in the paper.
Open Datasets Yes We perform the ablation studies on Leduc Hold em poker, which is a commonly used poker game in research community (Heinrich & Silver, 2016; Schmid et al., 2018; Steinberger, 2019; Lockhart et al., 2019). In our experiments, we test DNCFR on three Leduc Hold em instances with stack size 5, 10, and 15, which are denoted by Leduc(5), Leduc(10), and Leduc(15) respectively. To test DNCFR s scalability, we develop a neural agent to solve HUNL...
Dataset Splits No The paper describes sampling infosets within each iteration for training the neural networks but does not specify fixed train/validation/test dataset splits in the conventional sense for its DNCFR methodology. While 'validation sample' is mentioned in the context of their Deep Stack implementation, it does not describe the data partitioning for DNCFR's own experiments.
Hardware Specification Yes To optimize the counterfactual value network on turn subgame (this subgame looks ahead two rounds and contains both turn and river), we generate nine million samples. Because each sample is generated by traversing 1000 iterations using CFR+ algorithm based on a random reach probability, these huge samples are computation-expensive and cost 1500 nodes cluster (each node contains 32 CPU cores and 60GB memory) more than 60 days.
Software Dependencies No The paper mentions using Adam optimizer and LSTM with attention, but it does not specify explicit version numbers for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup Yes In experiments, we set the network hyperparameters as following. Hyperparameters on Leduc Hold em. The Leduc(5), Leduc(10) and Leduc(15) in our experiments have 1.1 104 infosets (6 104 states), 3 105 (1.5 106 states) and 3 106 (2 107 states) infosets respectively. We set k =3 as the default parameter in the provable robust sampling method on all such games. For the small Leduc(5), we select b=100 as the default parameter in the mini-batch MCCFR ??, which only samples 5.59% infosets in each iteration. For the larger Leduc(10) and Leduc(15), we select default b=500, which visit (observe) only 2.39% and 0.53% infosets in each iteration. To train RSN and ASN, we set the default embedding size for both neural networks as 16, 32, and 64 for Leduc(5), Leduc(10), and Leduc(15) respectively. There are 256 samples will be used to update the gradients of parameters by mini-batch stochastic gradient descent technique. We select Adam (Kingma & Ba, 2014) as the default optimizer and LSTM with attention as the default neural architecture in all the experiments. The neural networks only have 2608, 7424, and 23360 parameters respectively, which are much less than the number of infosets. The default learning rate of Adam is βlr =0.001. A scheduler, who will reduce the learning rate based on the number of epochs and the convergence rate of loss, help the neural agent to obtain a high accuracy. The learning rate will be reduced by 0.5 when loss has stopped improving after 10 epochs. The lower bound on the learning rate of all parameters in this scheduler is 10 6. To avoid the algorithm converging to potential local minima or saddle points, we will reset the learning rate to 0.001 and help the optimizer to obtain a better performance. θT best is the best parameters to achieve the lowest loss after T epochs. If average loss for epoch t is less than the specified criteria βloss=10 4 for RSN (set this parameter as 10 5 for RSN), we will early stop the optimizer. We set βepoch=2000 and update the optimizer 2000 maximum epochs. For ASN, we set the loss of early stopping criteria as 10 5. The learning rate will be reduced by 0.7 when loss has stopped improving after 15 epochs.