reproducibilityindex.ai

Policy Space Diversity for Non-Transitive Games

Authors: Jian Yao, Weiming Liu, Haobo Fu, Yaodong Yang, Stephen McAleer, Qiang Fu, Wei Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants. The experiment code is available at https://github.com/nigelyaoj/policy-space-diversity-psro.
Researcher Affiliation	Collaboration	Jian Yao1 , Weiming Liu 1 , Haobo Fu1 , Yaodong Yang2, Stephen Mc Aleer3, Qiang Fu1, Wei Yang1 1Tencent AI Lab, Shenzhen, China 2Peking University, Beijing, China 3Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 PSD-PSRO is provided in Appendix C.
Open Source Code	Yes	The experiment code is available at https://github.com/nigelyaoj/policy-space-diversity-psro.
Open Datasets	Yes	The benchmarks consist of single-state games (Alpha Star888 and non-transitive mixture game) and complex extensive games (Leduc poker and Goofspiel). Alpha Star888 is an empirical game generated from the process of solving Starcraft II [50]. Leduc Poker is a simplified poker [46]. Goofspiel is commonly used as a large-scale multi-stage simultaneous move game.
Dataset Splits	No	The paper describes training hyperparameters and reports exploitability and win rates, but it does not explicitly mention standard training, validation, or test dataset splits (e.g., percentages or specific counts) for reproducing data partitioning.
Hardware Specification	No	The paper does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	Table 6 'Hyperparameters for Leduc poker' mentions 'Oracle agent PPO' and 'Optimizer Adam', and for Goofspiel 'DQN as the Oracle agent'. However, it does not provide specific version numbers for these software components or the programming language used for implementation.
Experiment Setup	Yes	Table 6: Hyperparameters for Leduc poker lists specific values for 'Learning rate 3e-4', 'Discount factor (γ) 0.99', 'Clip 0.2', 'Max Gradient Norm 0.05', 'Episodes for each BR training 2e4', 'diversity weight λ 0.1'. It also specifies the policy network as 'MLP (state_dim-256-256-256-action_dim) Activation function in MLP Re Lu'.