Policy Space Diversity for Non-Transitive Games
Authors: Jian Yao, Weiming Liu, Haobo Fu, Yaodong Yang, Stephen McAleer, Qiang Fu, Wei Yang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants. The experiment code is available at https://github.com/nigelyaoj/policy-space-diversity-psro. |
| Researcher Affiliation | Collaboration | Jian Yao1 , Weiming Liu 1 , Haobo Fu1 , Yaodong Yang2, Stephen Mc Aleer3, Qiang Fu1, Wei Yang1 1Tencent AI Lab, Shenzhen, China 2Peking University, Beijing, China 3Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 PSD-PSRO is provided in Appendix C. |
| Open Source Code | Yes | The experiment code is available at https://github.com/nigelyaoj/policy-space-diversity-psro. |
| Open Datasets | Yes | The benchmarks consist of single-state games (Alpha Star888 and non-transitive mixture game) and complex extensive games (Leduc poker and Goofspiel). Alpha Star888 is an empirical game generated from the process of solving Starcraft II [50]. Leduc Poker is a simplified poker [46]. Goofspiel is commonly used as a large-scale multi-stage simultaneous move game. |
| Dataset Splits | No | The paper describes training hyperparameters and reports exploitability and win rates, but it does not explicitly mention standard training, validation, or test dataset splits (e.g., percentages or specific counts) for reproducing data partitioning. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | Table 6 'Hyperparameters for Leduc poker' mentions 'Oracle agent PPO' and 'Optimizer Adam', and for Goofspiel 'DQN as the Oracle agent'. However, it does not provide specific version numbers for these software components or the programming language used for implementation. |
| Experiment Setup | Yes | Table 6: Hyperparameters for Leduc poker lists specific values for 'Learning rate 3e-4', 'Discount factor (γ) 0.99', 'Clip 0.2', 'Max Gradient Norm 0.05', 'Episodes for each BR training 2e4', 'diversity weight λ 0.1'. It also specifies the policy network as 'MLP (state_dim-256-256-256-action_dim) Activation function in MLP Re Lu'. |