Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality

Authors: Stefanos Leonardos, Georgios Piliouras, Kelly Spendlove

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As showcased by our experiments in network zero-sum games, these theoretical results provide the necessary guarantees for an algorithmic approach to the currently open problem of equilibrium selection in competitive multi-agent settings.
Researcher Affiliation Academia Stefanos Leonardos, Georgios Piliouras Singapore University of Technology and Design {stefanos_leonardos;georgios}@sutd.edu.sg Kelly Spendlove University of Oxford spendlove@maths.ox.ac.uk
Pseudocode No The paper describes the Q-learning dynamics and update rules using mathematical equations, but it does not include any structured pseudocode or an algorithm block labeled as such.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets No The paper uses defined game environments (Asymmetric Matching Pennies, Match-Mismatch Game) for its experiments, which are described within the paper itself rather than being external, publicly available datasets with specific access information (URL, DOI, citation).
Dataset Splits No The paper describes simulations within defined game environments rather than experiments on traditional datasets with explicit training, validation, and test splits (e.g., 80/10/10 percentages or sample counts).
Hardware Specification No Our simulations concern theoretical abstractions of network games. The total amount of computation is not a concern and all our experiments can be reproduced in any conventional machine.
Software Dependencies No The paper does not specify any particular software dependencies with version numbers (e.g., specific libraries, frameworks, or solvers with their respective versions) required to reproduce the experiments.
Experiment Setup Yes We plot the exploration path along two representative exploration-exploitation policies: Explore-Then-Exploit (ETE) [5], which starts with (relatively) high exploration that gradually reduces to zero and Cyclical Learning Rate with 1 cycle (CLR-1) [50], which starts with low exploration, increases to high exploration around the half-life of the cycle and then decays to 0. Summary statistics from 100 runs with 3 profiles of exploration rates in a 7 non-dummy agent instance of (MMG).