Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games

Authors: Siqi Liu, Marc Lanctot, Luke Marris, Nicolas Heess

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with simplex-Neu PL across two domains. First, we study the imperfect-information game of goofspiel... Second, we explore the partially-observed, spatiotemporal strategy game of running-with-scissors...
Researcher Affiliation Collaboration 1University College London, UK 2DeepMind, UK. Correspondence to: Siqi Liu <liusiqi@google.com>.
Pseudocode Yes Algorithm 1 Simplex Neural Population Learning; Algorithm 2 MGS implementing PSRO-NASH.
Open Source Code No The paper mentions and uses the Open Spiel library, providing its citation and links to specific components within it (e.g., 'open_spiel/python/algorithms/policy_aggregator.py'), but does not explicitly state that the code for the Simplex Neural Population Learning method itself is open-sourced or provide a link to its implementation.
Open Datasets Yes The specific implementation of the game is available as part of Open Spiel (Lanctot et al., 2019), instantiated with the following game string: goofspiel(imp_info=true, egocentric=True, num_cards=5, points_order=descending, returns_type=point_difference)); Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V., Upadhyay, S., P erolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., Hennes, D., Morrill, D., Muller, P., Ewalds, T., Faulkner, R., Kram ar, J., Vylder, B. D., Saeta, B., Bradbury, J., Ding, D., Borgeaud, S., Lai, M., Schrittwieser, J., Anthony, T., Hughes, E., Danihelka, I., and Ryan-Davis, J. Open Spiel: A framework for reinforcement learning in games. Co RR, abs/1908.09453, 2019.
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits, as the data is generated dynamically through interaction with game environments rather than being loaded from a static dataset.
Hardware Specification Yes Across both domains, we used a single TPU-v2 both to perform gradient updates for neural population of policies and to serve their inference requests during simulation. The game simulation is then performed on 256 remote CPU actors for running-with-scissors and 128 for goofspiel.
Software Dependencies No The paper mentions using an MPO agent and the Open Spiel framework, but does not provide specific version numbers for the underlying software libraries such as deep learning frameworks (e.g., TensorFlow, PyTorch) or other key dependencies.
Experiment Setup Yes We used the same MPO agent (Abdolmaleki et al., 2018) as in Liu et al. (2022), with 20 action samples drawn and evaluated by the learned Q-function per state at each gradient update. The target (Q-value and policy) networks are updated every 100 gradient steps. The policy head and Q-value networks are parameterised by MLPs of (512, 256, 128, NUMACTIONS) and (512, 512, 128, 1) respectively with Elu activation. In goofspiel we invoke the meta-graph solver every 10,000 gradient updates... In running-with-scissors, we update the meta-graph every 1,000 gradient updates. Data are sampled uniformly from the replay server, with a maximum buffer size of 100,000 trajectories across both domains.