Learning to Schedule Communication in Multi-agent Reinforcement Learning

Authors: Daewoo Kim, Sangwoo Moon, David Hostallero, Wan Ju Kang, Taeyoung Lee, Kyunghwan Son, Yung Yi

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Sched Net against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between Sched Net and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.
Researcher Affiliation Academia Daewoo Kim, Sangwoo Moon, David Hostallero, Wan Ju Kang, Taeyoung Lee, Kyunghwan Son & Yung Yi School of Electrical Engineering, KAIST Daejeon, South Korea
Pseudocode Yes Algorithm 1 Sched Net
Open Source Code Yes The code is available on https://github.com/rhoowd/sched_net
Open Datasets Yes Environments To evaluate Sched Net, we consider two different environments for demonstrative purposes: Predator and Prey (PP) which is used in Stone & Veloso (2000), and Cooperative Communication and Navigation (CCN) which is the simplified version of the one in Lowe et al. (2017).
Dataset Splits No The paper describes simulated environments and training steps but does not provide explicit training/validation/test dataset splits as it applies to a static dataset.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory details) used to run the experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' but does not specify version numbers for any software components, libraries, or frameworks used.
Experiment Setup Yes Table 1 shows the values of the hyperparameters for the CCN and the PP task. Hyperparameter Value Description training step 750000 Maximum time steps until the end of training episode length 1000 Maximum time steps per episode discount factor 0.9 Importance of future rewards learning rate for actor 0.00001 Actor network learning rate used by Adam optimizer learning rate for critic 0.0001 Critic network learning rate used by Adam optimizer target update rate 0.05 Target network update rate to track learned network entropy regularization weight 0.01 Weight of regularization to encourage exploration