QPLEX: Duplex Dueling Multi-Agent Q-Learning
Authors: Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, Chongjie Zhang
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments on Star Craft II micromanagement tasks demonstrate that QPLEX significantly outperforms stateof-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration. |
| Researcher Affiliation | Collaboration | Jianhao Wang 1, Zhizhou Ren 1, Terry Liu1, Yang Yu2, Chongjie Zhang1 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Polixir Technologies, China |
| Pseudocode | No | No explicit pseudocode or algorithm block was found in the paper. |
| Open Source Code | No | The paper mentions 'Videos available at https://sites.google.com/view/qplex-marl/' but no explicit statement or link to the open-source code for the QPLEX methodology itself. |
| Open Datasets | Yes | Empirical results on more challenging Star Craft II tasks show that QPLEX significantly outperforms other multi-agent Q-learning baselines in online and offline data collections. |
| Dataset Splits | No | The paper mentions collecting 2 million timesteps of data and evaluating test win rate, and for offline collection, '20k or 50k experienced episodes', but does not specify explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | Our training time on an NVIDIA RTX 2080TI GPU of each task is about 6 hours to 20 hours, depending on the agent number and the episode length limit of each map. |
| Software Dependencies | No | The paper mentions using 'Py MARL (Samvelyan et al., 2019) implementation' for baselines and QPLEX, but no specific version numbers for PyMARL or other software dependencies like Python, PyTorch, etc., are provided. |
| Experiment Setup | Yes | We use ϵ-greedy exploration and a limited first-in-first-out (FIFO) replay buffer of size 5000 episodes, where ϵ is linearly annealed from 1.0 to 0.05 over 50k timesteps and keep it constant for the rest training process. To utilize the training buffer more efficiently, we perform gradient updates twice with a batch of 32 episodes after collecting every episode for each algorithm. |