reproducibilityindex.ai

Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games

Authors: Stefanos Leonardos, Will Overman, Ioannis Panageas, Georgios Piliouras

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS: CONGESTION GAMES; Results. The left panel of Figure 5 shows that the agents learn the expected Nash proﬁle in both states in all runs.; We implemented this environment with N = 4 agents... We used our implementation of the independent policy gradient algorithm with the same parameters as in our experiment from Section 5, speciﬁcally we have T = 20, γ = 0.99, and η = 0.0001. The results are shown in Figure 10.
Researcher Affiliation	Academia	Stefanos Leonardos Singapore University of Technology and Design stefanos leonardos@sutd.edu.sg William Overman University of California, Irvine overmana@uci.edu Ioannis Panageas University of California, Irvine ipanagea@ics.uci.edu Georgios Piliouras Singapore University of Technology and Design georgios@sutd.edu.sg
Pseudocode	No	The PGA algorithm is given by π(t+1) i := P (Ai)S π(t) i + η πi V i ρ(π(t)) , (PGA); PSGA) is given by π(t+1) i := P (Ai)S π(t) i + η ˆ (t) πi . (PSGA)
Open Source Code	Yes	We also uploaded the code that was used to run the experiments (policy gradient algorithm) as supplementary material.
Open Datasets	No	We consider an experiment (Figure 4) with N = 8 agents, Ai = 4 facilities (resources or locations) that the agents can select from and S = 2 states: a safe state and a distancing state.
Dataset Splits	No	No information about training/validation/test dataset splits is provided, as the paper conducts experiments in a simulated environment rather than on a static dataset.
Hardware Specification	No	No specific hardware details (such as GPU/CPU models, memory, or cloud instances) are mentioned for the experiments.
Software Dependencies	No	No specific software dependencies with version numbers are provided in the paper.
Experiment Setup	Yes	We perform episodic updates with T = 20 steps. At each iteration, we estimate the policy gradients using the average of mini-batches of size 20. We use γ = 0.99 and a common learning rate η = 0.0001 (larger than the theoretical guarantee, η = (1 γ)3 2γAmaxn 1e 08, of Theorem 4.2).