Learning to Communicate with Deep Multi-Agent Reinforcement Learning
Authors: Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, Shimon Whiteson
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments introduce new environments for studying the learning of communication protocols and present a set of engineering innovations that are essential for success in these domains. Experiments on two benchmark tasks, based on the MNIST dataset and a well known riddle, show, not only can these methods solve these tasks, they often discover elegant communication protocols along the way. |
| Researcher Affiliation | Collaboration | 1University of Oxford, United Kingdom 2Canadian Institute for Advanced Research, CIFAR NCAP Program 3Google Deep Mind |
| Pseudocode | Yes | Further algorithmic details and pseudocode are in the supplementary material. |
| Open Source Code | Yes | Source code is available at: https://github.com/iassael/learning-to-communicate |
| Open Datasets | Yes | Experiments on two benchmark tasks, based on the MNIST dataset and a well known riddle... MNIST digit classification dataset [25]. |
| Dataset Splits | No | The paper describes training RL agents within environments and evaluating their performance. It does not provide explicit training/validation/test dataset splits in the conventional sense for a static dataset like MNIST (e.g., '80% training, 10% validation, 10% test'). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions software components and algorithms like 'RMSProp' and 'GRU' but does not provide specific version numbers for these or other software dependencies necessary for replication. |
| Experiment Setup | Yes | In our experiments, we use an ϵ-greedy policy with ϵ = 0.05, the discount factor is γ = 1, and the target network is reset every 100 episodes. To stabilise learning, we execute parallel episodes in batches of 32. The parameters are optimised using RMSProp [19] with a learning rate of 5 10 4. Unless stated otherwise, we set the standard deviation of noise added to the channel to σ = 2, which was found to be essential for good performance. |