reproducibilityindex.ai

Self-Organized Polynomial-Time Coordination Graphs

Authors: Qianlan Yang, Weijun Dong, Zhizhou Ren, Jianhao Wang, Tonghan Wang, Chongjie Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we show that our approach learns succinct and well-adapted graph topologies, induces effective coordination, and improves performance across a variety of cooperative multi-agent tasks. In this section, we conduct experiments to answer the following questions: (1) Does the accurate greedy action selection improve our performance (see Section 5.1 and 5.2)? (2) Is the dynamic graph organization mechanism necessary for our algorithm (see Section 5.1 and 5.3)? (3) How well does SOP-CG perform on complex cooperative multi-agent tasks (see Fig. 4, 5 and 6)? (4) Can SOP-CG extract interpretable dynamic coordination structures in complex scenarios (see Section 5.4)?
Researcher Affiliation	Academia	1Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University 2Department of Computer Science, University of Illinois at Urbana-Champaign 3Harvard University.
Pseudocode	Yes	Algorithm 1 Self-Organized Polynomial-Time Coordination Graphs
Open Source Code	Yes	An open-source implementation of our algorithm is available online1. 1https://github.com/yan Qval/SOP-CG.
Open Datasets	Yes	In experiments, we evaluate SOP-CG on the MACO benchmark (Wang et al., 2021b), particle environment (Lowe et al., 2017) and Star Craft II (Samvelyan et al., 2019).
Dataset Splits	No	The paper mentions using a replay buffer and evaluating on 'test return' but does not provide explicit details about training, validation, or test dataset splits.
Hardware Specification	Yes	The experiments are finished on NVIDIA RTX 2080TI GPU.
Software Dependencies	No	The paper mentions software components like GRU, RMSProp, and PyMARL but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	All tasks in this paper use a discount factor γ = 0.99. We use ϵ-greedy exploration, and ϵ anneals linearly from 1.0 to 0.05 over 50000 time-steps. We use an RMSProp optimizer with a learning rate of 5e-3 to train our network. A first-in-first-out (FIFO) replay buffer stores the experiences of at most 5000 episodes, and a batch of 32 episodes are sampled from the buffer during the training phase. The target network is periodically updated every 200 episodes.