reproducibilityindex.ai

Decentralized Policy Gradient Descent Ascent for Safe Multi-Agent Reinforcement Learning

Authors: Songtao Lu, Kaiqing Zhang, Tianyi Chen, Tamer Başar, Lior Horesh8767-8775

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Convergence guarantees, together with numerical results, showcase the superiority of the proposed algorithm. Multiple numerical results showcase the superiority of the algorithms applied in the problems of safe decentralized RL compared with the classic decentralized methods without safety considerations. Numerical Results Problem setting To show the performance of safe decentralized RL, we test our algorithm on the environment of the Cooperative Navigation task in (Lowe et al. 2017), which is built on the popular Open AI Gym paradigm (Brockman et al. 2016). The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory. In the ﬁrst experiment, we have n = 5 agents aiming at ﬁnding their own landmarks, and all agents are connected by a well-connected graph as shown in Figure 1(a). From Figure 1(b), it can be observed that the averaged network constrained rewards obtained by Safe Dec-PG are much higher than the ones achieved by DSGT and Safe Dec-PG converges faster than DSGT as well.
Researcher Affiliation	Collaboration	1IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA 2University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA 3Rensselaer Polytechnic Institute, Troy, New York 12144, USA songtao@ibm.com, kzhang66@illinois.edu, chent18@rpi.edu, basar1@illinois.edu, lhoresh@us.ibm.com
Pseudocode	Yes	Algorithm 1 Safe Dec-PG Input: θ0 i , ϑ0 i = λ0 i = 0, i for r = 1, . . . do for Each agent i do Update θr+1 i by (10) Perform rollout to get b T,K θi fi(θr i ,λr i ) Update ϑr+1 i by (11) Calculate ( b JC i )T,K(θr+1 i ) Update λr+1 i by (13) end for end for
Open Source Code	No	The paper does not provide a direct link or explicit statement about the availability of open-source code for the methodology described.
Open Datasets	No	The paper does not provide concrete access information (link, DOI, repository, or formal citation with authors/year) for a publicly available or open dataset. It mentions 'Cooperative Navigation task in (Lowe et al. 2017)' and 'Open AI Gym paradigm (Brockman et al. 2016)' but without specific access details for the dataset used for their experiments.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It describes the environment and the number of agents but not the data partitioning. For example, it mentions 'T = 20' for horizon approximation and 'K = 10 Monte Carlo trials' for PG estimation, but these are related to simulation parameters, not dataset splits.
Hardware Specification	Yes	The experiments were run on the NVIDIA Tesla V100 GPU with 32GB memory.
Software Dependencies	No	The paper mentions 'neural network' and 'Open AI Gym paradigm' but does not specify software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	Parameters The policy at each agent is parametrized by a neural network, where there are two hidden layers with 30 neurons in the ﬁrst layer and 10 neurons in the second. The states of each agent include its position and velocity. Thus, the dimension of the input layer is 20, and the output layer is 5. The discounting factor γ in the cumulative loss is 0.99 in all the tests, and for each episode, the length of the horizon approximation of PG is T = 20. Also, we run K = 10 Monte Carlo trials independently to compute the approximate PG at each iteration. The initial stepsizes of Safe Dec-PG and DSGT are both 0.1 and ci = 0.8, i.