Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
Authors: Stefanos Leonardos, Will Overman, Ioannis Panageas, Georgios Piliouras
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS: CONGESTION GAMES; Results. The left panel of Figure 5 shows that the agents learn the expected Nash profile in both states in all runs.; We implemented this environment with N = 4 agents... We used our implementation of the independent policy gradient algorithm with the same parameters as in our experiment from Section 5, specifically we have T = 20, γ = 0.99, and η = 0.0001. The results are shown in Figure 10. |
| Researcher Affiliation | Academia | Stefanos Leonardos Singapore University of Technology and Design stefanos EMAIL William Overman University of California, Irvine EMAIL Ioannis Panageas University of California, Irvine EMAIL Georgios Piliouras Singapore University of Technology and Design EMAIL |
| Pseudocode | No | The PGA algorithm is given by π(t+1) i := P (Ai)S π(t) i + η πi V i ρ(π(t)) , (PGA); PSGA) is given by π(t+1) i := P (Ai)S π(t) i + η ˆ (t) πi . (PSGA) |
| Open Source Code | Yes | We also uploaded the code that was used to run the experiments (policy gradient algorithm) as supplementary material. |
| Open Datasets | No | We consider an experiment (Figure 4) with N = 8 agents, Ai = 4 facilities (resources or locations) that the agents can select from and S = 2 states: a safe state and a distancing state. |
| Dataset Splits | No | No information about training/validation/test dataset splits is provided, as the paper conducts experiments in a simulated environment rather than on a static dataset. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, memory, or cloud instances) are mentioned for the experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided in the paper. |
| Experiment Setup | Yes | We perform episodic updates with T = 20 steps. At each iteration, we estimate the policy gradients using the average of mini-batches of size 20. We use γ = 0.99 and a common learning rate η = 0.0001 (larger than the theoretical guarantee, η = (1 γ)3 2γAmaxn 1e 08, of Theorem 4.2). |