Self-Organized Polynomial-Time Coordination Graphs
Authors: Qianlan Yang, Weijun Dong, Zhizhou Ren, Jianhao Wang, Tonghan Wang, Chongjie Zhang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we show that our approach learns succinct and well-adapted graph topologies, induces effective coordination, and improves performance across a variety of cooperative multi-agent tasks. In this section, we conduct experiments to answer the following questions: (1) Does the accurate greedy action selection improve our performance (see Section 5.1 and 5.2)? (2) Is the dynamic graph organization mechanism necessary for our algorithm (see Section 5.1 and 5.3)? (3) How well does SOP-CG perform on complex cooperative multi-agent tasks (see Fig. 4, 5 and 6)? (4) Can SOP-CG extract interpretable dynamic coordination structures in complex scenarios (see Section 5.4)? |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University 2Department of Computer Science, University of Illinois at Urbana-Champaign 3Harvard University. |
| Pseudocode | Yes | Algorithm 1 Self-Organized Polynomial-Time Coordination Graphs |
| Open Source Code | Yes | An open-source implementation of our algorithm is available online1. 1https://github.com/yan Qval/SOP-CG. |
| Open Datasets | Yes | In experiments, we evaluate SOP-CG on the MACO benchmark (Wang et al., 2021b), particle environment (Lowe et al., 2017) and Star Craft II (Samvelyan et al., 2019). |
| Dataset Splits | No | The paper mentions using a replay buffer and evaluating on 'test return' but does not provide explicit details about training, validation, or test dataset splits. |
| Hardware Specification | Yes | The experiments are finished on NVIDIA RTX 2080TI GPU. |
| Software Dependencies | No | The paper mentions software components like GRU, RMSProp, and PyMARL but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | All tasks in this paper use a discount factor γ = 0.99. We use ϵ-greedy exploration, and ϵ anneals linearly from 1.0 to 0.05 over 50000 time-steps. We use an RMSProp optimizer with a learning rate of 5e-3 to train our network. A first-in-first-out (FIFO) replay buffer stores the experiences of at most 5000 episodes, and a batch of 32 episodes are sampled from the buffer during the training phase. The target network is periodically updated every 200 episodes. |