reproducibilityindex.ai

Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

Authors: Chao Qu, Shie Mannor, Huan Xu, Yuan Qi, Le Song, Junwu Xiong

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The goal of our experiment is two-fold: To better understand the effect of each component in the proposed algorithm; and to evaluate efﬁciency of value propagation in the off-policy setting. To this end, we ﬁrst do an ablation study on a simple random MDP problem, we then evaluate the performance on the cooperative navigation task [Lowe et al., 2017]. The settings of the experiment are similar to those in [Zhang et al., 2018]. Some implementation details are deferred to Appendix A.4 due to space constraints.
Researcher Affiliation	Collaboration	Chao Qu 1, Shie Mannor2, Huan Xu3,4, Yuan Qi1, Le Song1,4, and Junwu Xiong1 1Ant Financial Services Group 2 Technion 3Alibaba Group 4Georgia Institute of Technology
Pseudocode	Yes	Algorithm 1 Value Propagation
Open Source Code	No	The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets	No	The paper describes generating random MDPs and using a modified Cooperative Navigation task environment. It does not provide concrete access information (link, DOI, citation) for a publicly available or open dataset used for training.
Dataset Splits	No	The paper does not specify exact percentages, sample counts, or reference predefined train/validation/test splits. It mentions using a replay buffer but no explicit splitting methodology for validation.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models) used to run the experiments.
Software Dependencies	No	The paper mentions using ADAM for acceleration but does not specify version numbers for any software components or libraries.
Experiment Setup	Yes	In the experiment, we consider a multi-agent RL problem with N = 10 and N = 20 agents, where each agent has two actions. A discrete MDP is randomly generated with \|S\| = 32 states... For each agent i and each state-action pair (s, a), the reward Ri(s, a) is uniformly sampled from [0, 4]... Each agent has five actions which corresponds to going up, down, left, right with units 0.1 or staying at the position. The agent has high probability (0.95) to move in the direction following its action and go in other direction randomly otherwise. The maximum length of each epoch is set to be 500 steps. When the agent is close enough to the landmark, e.g., the distance is less than 0.1, we think it reaches the target and gets reward +5. When two agents are close to each other (with distance less than 0.1), we treat this case as a collision and a penalty -1 is received for each of the agents.