DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning

Authors: Daochen Zha, Jingru Xie, Wenye Ma, Sheng Zhang, Xiangru Lian, Xia Hu, Ji Liu

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments are designed to answer the following research questions. RQ1: How does Dou Zero compare with existing Dou Dizhu programs, such as rule-based strategies, supervised learning, RL-based methods, and MCTS-based solutions (Section 5.2)? RQ2: How will Dou Zero perform if we consider bidding phase (Section 5.3)? RQ3: How efficient is the training of Dou Zero (Section 5.4)? RQ4: How does Dou Zero compare with bootstrapping and actor critic methods (Section 5.5)? RQ5: Does the learned card playing strategies of Dou Zero align with human knowledge (Section 5.6)? RQ6: Is Dou Zero computationally efficient in inference compared with existing programs (Section 5.7)? RQ7: Can the two Peasants of Dou Zero learn to cooperate with each other (Section 5.8)?
Researcher Affiliation Collaboration 1Department of Computer Science and Engineering, Texas A&M University 2AI Platform, Kwai Inc. 3Georgia Institute of Technology. Correspondence to: Daochen Zha <daochen.zha@tamu.edu>.
Pseudocode Yes Algorithm 1 Actor Process of Dou Zero; Algorithm 2 Learner Process of Dou Zero
Open Source Code Yes The code and an online demo are released1 with the hope that this insight could motivate future work. 1https://github.com/kwai/Dou Zero
Open Datasets No We internally collect 226, 230 human expert matches from the players of the highest level in league in our Dou Dizhu game mobile app. Then we use the same state representation and neural architecture as Dou Zero to train supervised agents with 49, 990, 075 samples generated from these data. The paper describes using an internally collected dataset for supervised learning, but does not state that this dataset is publicly available or provide access information for it.
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction for its primary Dou Zero training or the supervised learning baseline, beyond mentioning sample generation for the latter.
Hardware Specification Yes We run all the experiments on a single server with 48 processors of Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz and four 1080 Ti GPUs.
Software Dependencies No Our implementation is based on Torch Beast framework (K uttler et al., 2019). The paper mentions the Torch Beast framework but does not provide specific version numbers for it or any other key software components like Python, PyTorch, or CUDA.
Experiment Setup Yes Each shared buffer has B = 50 entries with size S = 100, batch size M = 32, and ϵ = 0.01. We set discount factor γ = 1 since Dou Dizhu only has a nonzero reward in the last timestep and early moves are very important. We use Re LU as the activation function for each layer of MLP. We adopt RMSprop optimizer with a learning rate ψ = 0.0001, smoothing constant 0.99 and ϵ = 10 5. We train Dou Zero for 30 days.