Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning
Authors: Woojun Kim, Myungsik Cho, Youngchul Sung6079-6086
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed message-dropout technique for several games, and numerical results show that the proposed message-dropout technique with proper dropout rate improves the reinforcement learning performance significantly in terms of the training speed and the steady-state performance in the execution phase. |
| Researcher Affiliation | Academia | Woojun Kim, Myungsik Cho, Youngchul Sung School of Electrical Engineering, KAIST, Korea {woojun.kim, ms.cho, ycsung}@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 DCC with Message-Dropout (DCC-MD) |
| Open Source Code | No | The paper does not provide any concrete access to source code, such as a repository link or an explicit statement about code release in supplementary materials. |
| Open Datasets | Yes | The pursuit game is a standard task for multiagent systems (Vidal et al. 2002). The environment is made up of a two-dimensional grid and consists of N pursuers and M evaders. The goal of the game is to capture all evaders as fast as possible by training the agents (i.e., pursuers). Initially, all the evaders are at the center of the two-dimensional grid, and each evader randomly and independently chooses one of five actions at each time step: move North, East, West, South, or Stay. (Each evader stays if there exists a pursuer or a map boundary at the position where it is going to move.) Each pursuer is initially located at a random position of the map and has five possible actions: move North, East, West, South or Stay. When the four sides of an evader are surrounded by pursuers or map boundaries, the evader is removed and the pursuers who capture the evader receive R+ reward. All pursuers receive R 1 reward for each time step and R 2 reward if the pursuer hits the map boundary (the latter negative reward is to promote exploration). An episode ends when all the evaders are captured or T time steps elapse. As in (Gupta, Egorov, and Kochenderfer 2017), each pursuer observes its surrounding which consists of map boundary, evader(s), or other pursuer(s). We assume that each pursuer can observe up to D distances in four directions. Then, the observed information of each pursuer can be represent by a 3 (2D + 1) (2D + 1) observation window (which is the observation of each agent): a (2D + 1) (2D + 1) window detecting other pursuer(s), a (2D + 1) (2D + 1) window detecting evader(s), and a (2D + 1) (2D + 1) window detecting the map boundary. For the game of pursuit, we set R+ = 5, R 1 = 0.05, R 2 = 0.5, T = 500, M = 2 and D = 3 and simulate two cases: N = 6 and N = 8. The map size of the two cases are 15 15 and 17 17 respectively. |
| Dataset Splits | No | The paper refers to 'training phase' and 'execution phase' but does not explicitly mention a 'validation set' or define specific training/validation/test splits as commonly done in supervised learning contexts. Performance evaluation is described as 'after training'. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions general learning frameworks like DQN and MADDPG but does not specify any software libraries or dependencies with version numbers. |
| Experiment Setup | Yes | For the game of pursuit, we set R+ = 5, R 1 = 0.05, R 2 = 0.5, T = 500, M = 2 and D = 3 and simulate two cases: N = 6 and N = 8. |