Equilibrium Finding in Normal-Form Games via Greedy Regret Minimization

Authors: Hugh Zhang, Adam Lerer, Noam Brown9484-9492

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, experiments on large randomly generated games and normal-form subgames of the AI benchmark Diplomacy show that greedy weights outperforms previous methods whenever sampling is used, sometimes by several orders of magnitude.
Researcher Affiliation Collaboration Hugh Zhang1*, Adam Lerer2, Noam Brown2 1Harvard University 2Facebook AI Research hughzhang@fas.harvard.edu, alerer@fb.com, noambrown@fb.com
Pseudocode Yes Algorithm 1: Greedy Weights
Open Source Code Yes Code to replicate the random normal-form game experiments can be found at https://github.com/hughbzhang/greedy-weights.
Open Datasets No The paper describes generating random games and using subgames from Diplomacy, but does not provide specific access information (link, DOI, formal citation) to a publicly available or open dataset for training.
Dataset Splits No The paper does not explicitly provide details about dataset splits (training, validation, test) for reproducibility, as it focuses on algorithm convergence in generated game environments.
Hardware Specification Yes All experiments on random games (both zero-sum and general-sum) were done on a single CPU core.
Software Dependencies No The paper mentions using 'Open Spiel (Lanctot et al. 2019)' but does not provide specific version numbers for this or any other software dependency.
Experiment Setup Yes In particular, we observe in two-player zero-sum games that setting a weight floor of wsum 2t is often useful for speeding up convergence. In all other settings, we did not observe a floor to be beneficial. For internal regret minimization, we primarily use an extension of Blackwell s regret minimization given by Hart and Mas-Colell (2000) also known as regret matching. Regret matching also selects its policy with probability matching its past regrets of not switching to that action in the past, but it differs in that it additionally uses a fixed inertia parameter α and thus always retains a positive probability of staying in place, with probability approaching 1 as the overall regrets vanish. In all of our Diplomacy experiments, each player chooses between the 10 actions that have highest probability in the publicly available policy network from (Gray et al. 2021). We use the value network from (Gray et al. 2021) to determine payoffs.