Multi-Agent Interactions Modeling with Correlated Policies

Authors: Minghuan Liu, Ming Zhou, Weinan Zhang, Yuzheng Zhuang, Jun Wang, Wulong Liu, Yong Yu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Various experiments demonstrate that Co DAIL can better regenerate complex interactions close to the demonstrators and outperforms state-of-the-art multi-agent imitation learning methods.
Researcher Affiliation Collaboration 1 Shanghai Jiaotong University, 2 Huawei Noah s Ark Lab {minghuanliu, mingak, wnzhang, yyu}@sjtu.edu.cn, {zhuangyuzheng, w.j, liuwulong}@huawei.com
Pseudocode Yes We name our algorithm as Decentralized Adversarial Imitation Learning with Correlated policies (Correlated DAIL, a.k.a. Co DAIL) and present the training procedure in Appendix Algo. 1, which can be easily scaled to a distributed algorithm. [Appendix A.1 CoDAIL Algorithm, Algorithm 1 Co DAIL Algorithm]
Open Source Code Yes Our code is available at https://github.com/apexrl/Co DAIL.
Open Datasets Yes We test our method on the Particle World Environments (Lowe et al., 2017), which is a popular benchmark for evaluating multi-agent algorithms.
Dataset Splits No The paper mentions using "200 episodes of demonstrations" for training but does not provide specific percentages or counts for training, validation, or test dataset splits. It describes how the demonstrator policies were trained and how their agents were trained, but not how the demonstration *data* was split for the imitation learning process.
Hardware Specification No The paper describes the model architecture (e.g., "two layer MLPs with 128 cells") and training parameters, but it does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for running the experiments.
Software Dependencies No The paper mentions using "K-FAC optimizer (Martens & Grosse, 2015)" and refers to modifications of "ACKTR (Wu et al., 2017)", but it does not provide specific version numbers for these or other software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes During our experiments, we use two layer MLPs with 128 cells in each layer, for policy networks, value networks, discriminator networks and opponents model networks on all scenarios. The batch size is set to 1000. The policy is trained using K-FAC optimizer (Martens & Grosse, 2015) with learning rate of 0.1 and with a small λ of 0.05. All other parameters for KFAC optimizer are the same in (Wu et al., 2017). We train each algorithm for 55000 epochs with 5 random seeds to gain its average performance on all environments. [Appendix B, D]