reproducibilityindex.ai

QPLEX: Duplex Dueling Multi-Agent Q-Learning

Authors: Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, Chongjie Zhang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical experiments on Star Craft II micromanagement tasks demonstrate that QPLEX signiﬁcantly outperforms stateof-the-art baselines in both online and ofﬂine data collection settings, and also reveal that QPLEX achieves high sample efﬁciency and can beneﬁt from ofﬂine datasets without additional online exploration.
Researcher Affiliation	Collaboration	Jianhao Wang 1, Zhizhou Ren 1, Terry Liu1, Yang Yu2, Chongjie Zhang1 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Polixir Technologies, China
Pseudocode	No	No explicit pseudocode or algorithm block was found in the paper.
Open Source Code	No	The paper mentions 'Videos available at https://sites.google.com/view/qplex-marl/' but no explicit statement or link to the open-source code for the QPLEX methodology itself.
Open Datasets	Yes	Empirical results on more challenging Star Craft II tasks show that QPLEX signiﬁcantly outperforms other multi-agent Q-learning baselines in online and ofﬂine data collections.
Dataset Splits	No	The paper mentions collecting 2 million timesteps of data and evaluating test win rate, and for offline collection, '20k or 50k experienced episodes', but does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification	Yes	Our training time on an NVIDIA RTX 2080TI GPU of each task is about 6 hours to 20 hours, depending on the agent number and the episode length limit of each map.
Software Dependencies	No	The paper mentions using 'Py MARL (Samvelyan et al., 2019) implementation' for baselines and QPLEX, but no specific version numbers for PyMARL or other software dependencies like Python, PyTorch, etc., are provided.
Experiment Setup	Yes	We use ϵ-greedy exploration and a limited ﬁrst-in-ﬁrst-out (FIFO) replay buffer of size 5000 episodes, where ϵ is linearly annealed from 1.0 to 0.05 over 50k timesteps and keep it constant for the rest training process. To utilize the training buffer more efﬁciently, we perform gradient updates twice with a batch of 32 episodes after collecting every episode for each algorithm.