reproducibilityindex.ai

Q-value Path Decomposition for Deep Multiagent Reinforcement Learning

Authors: Yaodong Yang, Jianye Hao, Guangyong Chen, Hongyao Tang, Yingfeng Chen, Yujing Hu, Changjie Fan, Zhongyu Wei

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate QPD on the challenging Star Craft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.
Researcher Affiliation	Collaboration	1College of Intelligence and Computing, Tianjin University 2Huawei Noah s Ark Lab 3Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China 4Tencent Quantum Lab 5Net Ease Fuxi AI Lab 6Fudan University.
Pseudocode	Yes	Algorithm 1 Q-value Path Decomposition algorithm
Open Source Code	No	The paper does not provide an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We use Star Craft Multi Agent Challenge (SMAC) environment (Samvelyan et al., 2019) as our testbed.
Dataset Splits	No	The paper describes training and testing episodes but does not explicitly mention a separate validation dataset split with specific percentages or counts.
Hardware Specification	No	The paper describes the network architectures and training configurations but does not provide specific details on the hardware used (e.g., GPU/CPU models, memory specifications) for running experiments.
Software Dependencies	No	The paper mentions optimizers (RMSprop, Adam) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or other libraries).
Experiment Setup	Yes	The architecture of agent Q-networks is a DRQN with an LSTM layer with a 64-dimensional hidden state, with a fully-connected layer after, and ﬁnally a fully-connected layer with \|A\| outputs. The input for agent networks is the sequential data which consists of the agent s local observations in latest 12 time steps for all scenarios. The architecture of the QPD critic is a feedforward neural network with the ﬁrst two dense layers having 64 units for each channel, and then being concatenated or added in each group, and next being concatenated to the output layer of one unit. We set γ at 0.99. To speed up learning, we share the parameters across all individual Q-networks and a one-hot encoding of the agent type is concatenated onto each agent s observations to allow the learning of diverse behaviors. All agent networks are trained using RMSprop with a learning rate of 5 10 4 and the critic is trained with Adam with the same learning rate. Replay buffer contains the most recent 1000 trajectories and the batch size is 32. Target networks for the global critic are updated after every 200 training episodes.