Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game
Authors: Haobo Fu, Weiming Liu, Shuang Wu, Yijia Wang, Tao Yang, Kai Li, Junliang Xing, Bin Li, Bo Ma, QIANG FU, Yang Wei
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods. |
| Researcher Affiliation | Collaboration | 1 Tencent AI Lab, Shenzhen, China 2 University of Science and Technology of China, Hefei, China 3 Peking University, Beijing, China 4 Institute of Automation, Chinese Academy of Sciences, Beijing, China 5 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China |
| Pseudocode | Yes | The pseudocode of NW-CFR is given in Algorithm 1. ... The pseudocode of ACH is given in Algorithm 2. |
| Open Source Code | Yes | The code of the 1-on-1 Mahjong benchmark is available at https://github.com/yata0/Mahjong. The code of ACH is available at https://github.com/Liuweiming/ACH_poker. |
| Open Datasets | Yes | To facilitate research on large-scale 2-player zero-sum IIGs, we propose a 1-on-1 Mahjong benchmark. ... The code of the 1-on-1 Mahjong benchmark is available at https://github.com/yata0/Mahjong. ... FHP is a simplified Heads-up Limit Texas Hold em (HULH)... Additional results on smaller benchmarks from Open Spiel (Lanctot et al., 2019) are given in the Appendix G. |
| Dataset Splits | No | The paper describes training and evaluation but does not provide specific percentages or counts for training, validation, and test splits. |
| Hardware Specification | Yes | All methods run in an asynchronous training platform with overall 800 CPUs, 3200 GB memory, and 8 M40 GPUs in the Ubuntu 16.04 operating system. |
| Software Dependencies | No | The paper mentions 'Ubuntu 16.04 operating system' but does not provide specific version numbers for other key software components or libraries. |
| Experiment Setup | Yes | We performed a mild hyper-parameter search on PPO and shared the best setting for all methods. The advantage value is estimated by the Generalized Advantage Estimator (GAE(λ)) (Schulman et al., 2016) for all methods. An overview of the hyper-parameters is listed in the Appendix H.1. ... Table 5 gives an overview of hyper-parameters for each method. |