Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games

Authors: Stefanos Leonardos, Will Overman, Ioannis Panageas, Georgios Piliouras

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS: CONGESTION GAMES; Results. The left panel of Figure 5 shows that the agents learn the expected Nash profile in both states in all runs.; We implemented this environment with N = 4 agents... We used our implementation of the independent policy gradient algorithm with the same parameters as in our experiment from Section 5, specifically we have T = 20, γ = 0.99, and η = 0.0001. The results are shown in Figure 10.
Researcher Affiliation Academia Stefanos Leonardos Singapore University of Technology and Design stefanos leonardos@sutd.edu.sg William Overman University of California, Irvine overmana@uci.edu Ioannis Panageas University of California, Irvine ipanagea@ics.uci.edu Georgios Piliouras Singapore University of Technology and Design georgios@sutd.edu.sg
Pseudocode No The PGA algorithm is given by π(t+1) i := P (Ai)S π(t) i + η πi V i ρ(π(t)) , (PGA); PSGA) is given by π(t+1) i := P (Ai)S π(t) i + η ˆ (t) πi . (PSGA)
Open Source Code Yes We also uploaded the code that was used to run the experiments (policy gradient algorithm) as supplementary material.
Open Datasets No We consider an experiment (Figure 4) with N = 8 agents, Ai = 4 facilities (resources or locations) that the agents can select from and S = 2 states: a safe state and a distancing state.
Dataset Splits No No information about training/validation/test dataset splits is provided, as the paper conducts experiments in a simulated environment rather than on a static dataset.
Hardware Specification No No specific hardware details (such as GPU/CPU models, memory, or cloud instances) are mentioned for the experiments.
Software Dependencies No No specific software dependencies with version numbers are provided in the paper.
Experiment Setup Yes We perform episodic updates with T = 20 steps. At each iteration, we estimate the policy gradients using the average of mini-batches of size 20. We use γ = 0.99 and a common learning rate η = 0.0001 (larger than the theoretical guarantee, η = (1 γ)3 2γAmaxn 1e 08, of Theorem 4.2).