Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

Authors: Chao Qu, Shie Mannor, Huan Xu, Yuan Qi, Le Song, Junwu Xiong

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The goal of our experiment is two-fold: To better understand the effect of each component in the proposed algorithm; and to evaluate efficiency of value propagation in the off-policy setting. To this end, we first do an ablation study on a simple random MDP problem, we then evaluate the performance on the cooperative navigation task [Lowe et al., 2017]. The settings of the experiment are similar to those in [Zhang et al., 2018]. Some implementation details are deferred to Appendix A.4 due to space constraints.
Researcher Affiliation Collaboration Chao Qu 1, Shie Mannor2, Huan Xu3,4, Yuan Qi1, Le Song1,4, and Junwu Xiong1 1Ant Financial Services Group 2 Technion 3Alibaba Group 4Georgia Institute of Technology
Pseudocode Yes Algorithm 1 Value Propagation
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets No The paper describes generating random MDPs and using a modified Cooperative Navigation task environment. It does not provide concrete access information (link, DOI, citation) for a publicly available or open dataset used for training.
Dataset Splits No The paper does not specify exact percentages, sample counts, or reference predefined train/validation/test splits. It mentions using a replay buffer but no explicit splitting methodology for validation.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models) used to run the experiments.
Software Dependencies No The paper mentions using ADAM for acceleration but does not specify version numbers for any software components or libraries.
Experiment Setup Yes In the experiment, we consider a multi-agent RL problem with N = 10 and N = 20 agents, where each agent has two actions. A discrete MDP is randomly generated with |S| = 32 states... For each agent i and each state-action pair (s, a), the reward Ri(s, a) is uniformly sampled from [0, 4]... Each agent has five actions which corresponds to going up, down, left, right with units 0.1 or staying at the position. The agent has high probability (0.95) to move in the direction following its action and go in other direction randomly otherwise. The maximum length of each epoch is set to be 500 steps. When the agent is close enough to the landmark, e.g., the distance is less than 0.1, we think it reaches the target and gets reward +5. When two agents are close to each other (with distance less than 0.1), we treat this case as a collision and a penalty -1 is received for each of the agents.