Decentralized Policy Gradient Descent Ascent for Safe Multi-Agent Reinforcement Learning
Authors: Songtao Lu, Kaiqing Zhang, Tianyi Chen, Tamer Başar, Lior Horesh8767-8775
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Convergence guarantees, together with numerical results, showcase the superiority of the proposed algorithm. Multiple numerical results showcase the superiority of the algorithms applied in the problems of safe decentralized RL compared with the classic decentralized methods without safety considerations. Numerical Results Problem setting To show the performance of safe decentralized RL, we test our algorithm on the environment of the Cooperative Navigation task in (Lowe et al. 2017), which is built on the popular Open AI Gym paradigm (Brockman et al. 2016). The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory. In the first experiment, we have n = 5 agents aiming at finding their own landmarks, and all agents are connected by a well-connected graph as shown in Figure 1(a). From Figure 1(b), it can be observed that the averaged network constrained rewards obtained by Safe Dec-PG are much higher than the ones achieved by DSGT and Safe Dec-PG converges faster than DSGT as well. |
| Researcher Affiliation | Collaboration | 1IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA 2University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA 3Rensselaer Polytechnic Institute, Troy, New York 12144, USA songtao@ibm.com, kzhang66@illinois.edu, chent18@rpi.edu, basar1@illinois.edu, lhoresh@us.ibm.com |
| Pseudocode | Yes | Algorithm 1 Safe Dec-PG Input: θ0 i , ϑ0 i = λ0 i = 0, i for r = 1, . . . do for Each agent i do Update θr+1 i by (10) Perform rollout to get b T,K θi fi(θr i ,λr i ) Update ϑr+1 i by (11) Calculate ( b JC i )T,K(θr+1 i ) Update λr+1 i by (13) end for end for |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of open-source code for the methodology described. |
| Open Datasets | No | The paper does not provide concrete access information (link, DOI, repository, or formal citation with authors/year) for a publicly available or open dataset. It mentions 'Cooperative Navigation task in (Lowe et al. 2017)' and 'Open AI Gym paradigm (Brockman et al. 2016)' but without specific access details for the dataset used for their experiments. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It describes the environment and the number of agents but not the data partitioning. For example, it mentions 'T = 20' for horizon approximation and 'K = 10 Monte Carlo trials' for PG estimation, but these are related to simulation parameters, not dataset splits. |
| Hardware Specification | Yes | The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory. |
| Software Dependencies | No | The paper mentions 'neural network' and 'Open AI Gym paradigm' but does not specify software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | Parameters The policy at each agent is parametrized by a neural network, where there are two hidden layers with 30 neurons in the first layer and 10 neurons in the second. The states of each agent include its position and velocity. Thus, the dimension of the input layer is 20, and the output layer is 5. The discounting factor γ in the cumulative loss is 0.99 in all the tests, and for each episode, the length of the horizon approximation of PG is T = 20. Also, we run K = 10 Monte Carlo trials independently to compute the approximate PG at each iteration. The initial stepsizes of Safe Dec-PG and DSGT are both 0.1 and ci = 0.8, i. |