Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game

Authors: Haobo Fu, Weiming Liu, Shuang Wu, Yijia Wang, Tao Yang, Kai Li, Junliang Xing, Bin Li, Bo Ma, QIANG FU, Yang Wei

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods.
Researcher Affiliation Collaboration 1 Tencent AI Lab, Shenzhen, China 2 University of Science and Technology of China, Hefei, China 3 Peking University, Beijing, China 4 Institute of Automation, Chinese Academy of Sciences, Beijing, China 5 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Pseudocode Yes The pseudocode of NW-CFR is given in Algorithm 1. ... The pseudocode of ACH is given in Algorithm 2.
Open Source Code Yes The code of the 1-on-1 Mahjong benchmark is available at https://github.com/yata0/Mahjong. The code of ACH is available at https://github.com/Liuweiming/ACH_poker.
Open Datasets Yes To facilitate research on large-scale 2-player zero-sum IIGs, we propose a 1-on-1 Mahjong benchmark. ... The code of the 1-on-1 Mahjong benchmark is available at https://github.com/yata0/Mahjong. ... FHP is a simplified Heads-up Limit Texas Hold em (HULH)... Additional results on smaller benchmarks from Open Spiel (Lanctot et al., 2019) are given in the Appendix G.
Dataset Splits No The paper describes training and evaluation but does not provide specific percentages or counts for training, validation, and test splits.
Hardware Specification Yes All methods run in an asynchronous training platform with overall 800 CPUs, 3200 GB memory, and 8 M40 GPUs in the Ubuntu 16.04 operating system.
Software Dependencies No The paper mentions 'Ubuntu 16.04 operating system' but does not provide specific version numbers for other key software components or libraries.
Experiment Setup Yes We performed a mild hyper-parameter search on PPO and shared the best setting for all methods. The advantage value is estimated by the Generalized Advantage Estimator (GAE(λ)) (Schulman et al., 2016) for all methods. An overview of the hyper-parameters is listed in the Appendix H.1. ... Table 5 gives an overview of hyper-parameters for each method.