Correcting experience replay for multi-agent communication

Authors: Sanjeevan Ahilan, Peter Dayan

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments in the multi-agent particle environment2 (MPE), a world with a continuous observation and discrete action space, along with some basic simulated physics. For common problems in the MPE, immediate observations summarise relevant history, such as velocity, such that optimal policies can be learned using feedforward networks, which we use for both policy and critic. We provide precise details on implementation and hyperparameters in Appendix A.1.
Researcher Affiliation Academia Sanjeevan Ahilan Gatsby Computational Neuroscience Unit University College London ahilan@gatsby.ucl.ac.uk Peter Dayan Max Planck Institute for Biological Cybernetics and University of T ubingen dayan@tue.mpg.de
Pseudocode Yes Algorithm 1: Ordered Communication Correction
Open Source Code No The paper references third-party implementations of MADDPG and MAAC, but does not explicitly state that the authors' own code for their communication correction method is open-source or provide a link to it.
Open Datasets Yes We conduct experiments in the multi-agent particle environment2 (MPE), a world with a continuous observation and discrete action space, along with some basic simulated physics. 2https://github.com/openai/multiagent-particle-envs
Dataset Splits No The paper describes training with a replay buffer and random seeds, but does not specify explicit train/validation/test dataset splits as would be typical for fixed datasets.
Hardware Specification No The paper does not specify any hardware details such as GPU or CPU models, or specific cloud computing resources used for running the experiments.
Software Dependencies No The paper mentions using the Adam optimizer and the Gumbel Softmax Estimator, and refers to existing implementations of MADDPG and MAAC, but does not provide specific version numbers for software dependencies like Python or deep learning frameworks.
Experiment Setup Yes For all algorithms and experiments, we used the Adam optimizer with a learning rate of 0.001 and τ = 0.01 for updating the target networks. The size of the replay buffer was 107 and we updated the network parameters after every 100 samples added to the replay buffer. We used a batch size of 1024 episodes before making an update. We trained with 20 random seeds for all experiments and show using a shaded region the standard error in the mean. For MADDPG and all MADDPG variants, hyperparameters were optimised using a line search centred on the experimental parameters used in Lowe et al. (2017) but with 64 neurons per layer in each feedforward network (each with two hidden layers). We found a value of γ = 0.75 worked best on Cooperative Communication with 6 landmarks evaluated after 50, 000 episodes. We use the Straight-Through Gumbel Softmax estimator with an inverse temperature parameter of 1 to generate discrete actions (see A.2).