Adaptively Perturbed Mirror Descent for Learning in Games

Authors: Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Atsushi Iwasaki

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experiments This section empirically compares the representative instance of MD, namely Multiplicative Weight Update (MWU) and its Optimistic version (OMWU), with our framework. Specifically, we consider the following three instances of APMD: (i) both are the squared ℓ2-distance; (ii) both G and Dψ are the KL divergence, which is also an instance of Reward transformed FTRL in Example 5.2. Note that if the slingshot strategy is fixed to a uniformly random strategy, this algorithm corresponds to Boltzmann Q-Learning in Example 5.1; (iii) the divergence function G is the reverse KL divergence, and the Bregman divergence Dψ is the KL divergence, which matches Mutant FTRL in Example 5.3. We focus on two zero-sum polymatrix games: Three-Player Biased Rock-Paper-Scissors (3BRPS) and three-player random payoff games with 10 and 50 actions.
Researcher Affiliation Collaboration 1Cyber Agent, Tokyo, Japan 2University of Electro Communications, Tokyo, Japan.
Pseudocode Yes Algorithm 1 APMD for player i.
Open Source Code Yes 1An implementation of our method is available at https://github.com/Cyber Agent AILab/ adaptively-perturbed-md
Open Datasets No The paper mentions 'Three-Player Biased Rock-Paper-Scissors (3BRPS)' referring to 'Table 2 in Appendix I' for its payoff matrix, and 'three-player random payoff games' where each entry is 'drawn independently from a uniform distribution on the interval [-1, 1]'. While the generation method is described, no concrete access information (link, DOI, specific repository name, or formal citation with authors and year for a publicly available dataset) is provided for either of these.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, or testing.
Hardware Specification Yes The experiments in Section 6 are conducted in Ubuntu 20.04.2 LTS with Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz and 64GB RAM.
Software Dependencies No The paper mentions the operating system 'Ubuntu 20.04.2 LTS' but does not specify any key software components or libraries with their version numbers (e.g., Python, PyTorch, or specific solvers).
Experiment Setup Yes Unless otherwise specified, we use a constant learning rate η = 0.1 and a perturbation strength µ = 0.1 for APMD. Further details and additional experiments can be found in Appendix I. For APMD, we set µ = 0.1 and Tσ = 100 for KL and reverse KL divergence perturbation, and set µ = 0.1 and Tσ = 20 for squared ℓ2-distance perturbation. As an exception, η = 0.01, µ = 1.0, and Tσ = 200 are used for APMD with squared ℓ2-distance perturbation in the random payoff games with 50 actions. For the noisy feedback setting, we use the lower learning rate η = 0.01 for all algorithms, except APMD with squared ℓ2-distance perturbation for the random payoff games with 50 actions. We update the slingshot strategy profile σk every Tσ = 1000 iterations in APMD with KL and reverse KL divergence perturbation, and update it every Tσ = 200 iterations in APMD with squared ℓ2-distance perturbation. For APMD with ℓ2-distance perturbation in the random payoff games with 50 actions, we set η = 0.001 and Tσ = 2000.