Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
Authors: Stefanos Leonardos, Will Overman, Ioannis Panageas, Georgios Piliouras
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS: CONGESTION GAMES; Results. The left panel of Figure 5 shows that the agents learn the expected Nash profile in both states in all runs.; We implemented this environment with N = 4 agents... We used our implementation of the independent policy gradient algorithm with the same parameters as in our experiment from Section 5, specifically we have T = 20, γ = 0.99, and η = 0.0001. The results are shown in Figure 10. |
| Researcher Affiliation | Academia | Stefanos Leonardos Singapore University of Technology and Design stefanos leonardos@sutd.edu.sg William Overman University of California, Irvine overmana@uci.edu Ioannis Panageas University of California, Irvine ipanagea@ics.uci.edu Georgios Piliouras Singapore University of Technology and Design georgios@sutd.edu.sg |
| Pseudocode | No | The PGA algorithm is given by π(t+1) i := P (Ai)S π(t) i + η πi V i ρ(π(t)) , (PGA); PSGA) is given by π(t+1) i := P (Ai)S π(t) i + η ˆ (t) πi . (PSGA) |
| Open Source Code | Yes | We also uploaded the code that was used to run the experiments (policy gradient algorithm) as supplementary material. |
| Open Datasets | No | We consider an experiment (Figure 4) with N = 8 agents, Ai = 4 facilities (resources or locations) that the agents can select from and S = 2 states: a safe state and a distancing state. |
| Dataset Splits | No | No information about training/validation/test dataset splits is provided, as the paper conducts experiments in a simulated environment rather than on a static dataset. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, memory, or cloud instances) are mentioned for the experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided in the paper. |
| Experiment Setup | Yes | We perform episodic updates with T = 20 steps. At each iteration, we estimate the policy gradients using the average of mini-batches of size 20. We use γ = 0.99 and a common learning rate η = 0.0001 (larger than the theoretical guarantee, η = (1 γ)3 2γAmaxn 1e 08, of Theorem 4.2). |