Q-value Path Decomposition for Deep Multiagent Reinforcement Learning
Authors: Yaodong Yang, Jianye Hao, Guangyong Chen, Hongyao Tang, Yingfeng Chen, Yujing Hu, Changjie Fan, Zhongyu Wei
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate QPD on the challenging Star Craft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms. |
| Researcher Affiliation | Collaboration | 1College of Intelligence and Computing, Tianjin University 2Huawei Noah s Ark Lab 3Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China 4Tencent Quantum Lab 5Net Ease Fuxi AI Lab 6Fudan University. |
| Pseudocode | Yes | Algorithm 1 Q-value Path Decomposition algorithm |
| Open Source Code | No | The paper does not provide an explicit statement or a link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use Star Craft Multi Agent Challenge (SMAC) environment (Samvelyan et al., 2019) as our testbed. |
| Dataset Splits | No | The paper describes training and testing episodes but does not explicitly mention a separate validation dataset split with specific percentages or counts. |
| Hardware Specification | No | The paper describes the network architectures and training configurations but does not provide specific details on the hardware used (e.g., GPU/CPU models, memory specifications) for running experiments. |
| Software Dependencies | No | The paper mentions optimizers (RMSprop, Adam) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or other libraries). |
| Experiment Setup | Yes | The architecture of agent Q-networks is a DRQN with an LSTM layer with a 64-dimensional hidden state, with a fully-connected layer after, and finally a fully-connected layer with |A| outputs. The input for agent networks is the sequential data which consists of the agent s local observations in latest 12 time steps for all scenarios. The architecture of the QPD critic is a feedforward neural network with the first two dense layers having 64 units for each channel, and then being concatenated or added in each group, and next being concatenated to the output layer of one unit. We set γ at 0.99. To speed up learning, we share the parameters across all individual Q-networks and a one-hot encoding of the agent type is concatenated onto each agent s observations to allow the learning of diverse behaviors. All agent networks are trained using RMSprop with a learning rate of 5 10 4 and the critic is trained with Adam with the same learning rate. Replay buffer contains the most recent 1000 trajectories and the batch size is 32. Target networks for the global critic are updated after every 200 training episodes. |