QPLEX: Duplex Dueling Multi-Agent Q-Learning

Authors: Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, Chongjie Zhang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments on Star Craft II micromanagement tasks demonstrate that QPLEX significantly outperforms stateof-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.
Researcher Affiliation Collaboration Jianhao Wang 1, Zhizhou Ren 1, Terry Liu1, Yang Yu2, Chongjie Zhang1 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Polixir Technologies, China
Pseudocode No No explicit pseudocode or algorithm block was found in the paper.
Open Source Code No The paper mentions 'Videos available at https://sites.google.com/view/qplex-marl/' but no explicit statement or link to the open-source code for the QPLEX methodology itself.
Open Datasets Yes Empirical results on more challenging Star Craft II tasks show that QPLEX significantly outperforms other multi-agent Q-learning baselines in online and offline data collections.
Dataset Splits No The paper mentions collecting 2 million timesteps of data and evaluating test win rate, and for offline collection, '20k or 50k experienced episodes', but does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification Yes Our training time on an NVIDIA RTX 2080TI GPU of each task is about 6 hours to 20 hours, depending on the agent number and the episode length limit of each map.
Software Dependencies No The paper mentions using 'Py MARL (Samvelyan et al., 2019) implementation' for baselines and QPLEX, but no specific version numbers for PyMARL or other software dependencies like Python, PyTorch, etc., are provided.
Experiment Setup Yes We use ϵ-greedy exploration and a limited first-in-first-out (FIFO) replay buffer of size 5000 episodes, where ϵ is linearly annealed from 1.0 to 0.05 over 50k timesteps and keep it constant for the rest training process. To utilize the training buffer more efficiently, we perform gradient updates twice with a batch of 32 episodes after collecting every episode for each algorithm.