Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning

Authors: Woojun Kim, Myungsik Cho, Youngchul Sung6079-6086

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed message-dropout technique for several games, and numerical results show that the proposed message-dropout technique with proper dropout rate improves the reinforcement learning performance significantly in terms of the training speed and the steady-state performance in the execution phase.
Researcher Affiliation Academia Woojun Kim, Myungsik Cho, Youngchul Sung School of Electrical Engineering, KAIST, Korea {woojun.kim, ms.cho, ycsung}@kaist.ac.kr
Pseudocode Yes Algorithm 1 DCC with Message-Dropout (DCC-MD)
Open Source Code No The paper does not provide any concrete access to source code, such as a repository link or an explicit statement about code release in supplementary materials.
Open Datasets Yes The pursuit game is a standard task for multiagent systems (Vidal et al. 2002). The environment is made up of a two-dimensional grid and consists of N pursuers and M evaders. The goal of the game is to capture all evaders as fast as possible by training the agents (i.e., pursuers). Initially, all the evaders are at the center of the two-dimensional grid, and each evader randomly and independently chooses one of five actions at each time step: move North, East, West, South, or Stay. (Each evader stays if there exists a pursuer or a map boundary at the position where it is going to move.) Each pursuer is initially located at a random position of the map and has five possible actions: move North, East, West, South or Stay. When the four sides of an evader are surrounded by pursuers or map boundaries, the evader is removed and the pursuers who capture the evader receive R+ reward. All pursuers receive R 1 reward for each time step and R 2 reward if the pursuer hits the map boundary (the latter negative reward is to promote exploration). An episode ends when all the evaders are captured or T time steps elapse. As in (Gupta, Egorov, and Kochenderfer 2017), each pursuer observes its surrounding which consists of map boundary, evader(s), or other pursuer(s). We assume that each pursuer can observe up to D distances in four directions. Then, the observed information of each pursuer can be represent by a 3 (2D + 1) (2D + 1) observation window (which is the observation of each agent): a (2D + 1) (2D + 1) window detecting other pursuer(s), a (2D + 1) (2D + 1) window detecting evader(s), and a (2D + 1) (2D + 1) window detecting the map boundary. For the game of pursuit, we set R+ = 5, R 1 = 0.05, R 2 = 0.5, T = 500, M = 2 and D = 3 and simulate two cases: N = 6 and N = 8. The map size of the two cases are 15 15 and 17 17 respectively.
Dataset Splits No The paper refers to 'training phase' and 'execution phase' but does not explicitly mention a 'validation set' or define specific training/validation/test splits as commonly done in supervised learning contexts. Performance evaluation is described as 'after training'.
Hardware Specification No The paper does not provide specific hardware details such as CPU/GPU models or memory specifications used for running experiments.
Software Dependencies No The paper mentions general learning frameworks like DQN and MADDPG but does not specify any software libraries or dependencies with version numbers.
Experiment Setup Yes For the game of pursuit, we set R+ = 5, R 1 = 0.05, R 2 = 0.5, T = 500, M = 2 and D = 3 and simulate two cases: N = 6 and N = 8.