Learning Expensive Coordination: An Event-Based Deep RL Approach

Authors: Zhenyu Shi*, Runsheng Yu*, Xinrun Wang*, Rundong Wang, Youzhi Zhang, Hanjiang Lai, Bo An

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically. 5 EXPERIMENTAL RESULTS
Researcher Affiliation Academia School of Computer Science and Engineering, Nanyang Technological University, Singapore runshengyu@gmail.com, {xwang033,rundong001,yzhang137}@e.ntu.edu.sg, boan@ntu.edu.sg Zhen Yu Shi & Hanjiang Lai School of Data and Computer Science, Sun Yat-sen University Guangzhou, China shizhy6@mail2.sysu.edu.cn, laihanj3@mail.sysu.edu.cn
Pseudocode Yes The pseudo-code can be found in Appendix C. Algorithm 1: EBPG Algorithm 2: Action Choices for Leader
Open Source Code No The paper does not contain any explicit statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets No The paper mentions tasks like 'Resource Collections', 'Modified Navigation', and 'Modified Predator-Prey' based on prior work but does not provide specific links, DOIs, repositories, or formal citations with authors and years for public access to the *modified* datasets or environments used in their experiments.
Dataset Splits No The paper mentions 'The total training episode is 250, 000 for all the tasks' but does not specify exact training, validation, and test dataset splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification Yes Our method takes less than two days to train on a NVIDIA Geforce GTX 1080Ti GPU in each experiment.
Software Dependencies No Our code is implemented in Pytorch (Paszke et al., 2017). The optimization algorithm is Adam (Kingma & Ba, 2014). (No specific version numbers for PyTorch or Adam are provided)
Experiment Setup Yes If no special mention, the batch size is 1 (online learning). Similar to (Shu & Tian, 2019), we set the learning rate as 0.001 for the leader s critic and followers while 0.0003 for the leader s policy. The optimization algorithm is Adam (Kingma & Ba, 2014). For the loss function, we set the λ1 = 0.01 and λ2 = 0.001. The total training episode is 250, 000 for all the tasks (including both the rule-based followers and the RL-based followers). To encourage exploration, we use the ι-greedy2. For the leader, the exploration rate is set to 0.1 and slightly decreases to zero (5000 episode). For the followers, the exploration rate for each agent is always 0.3 (except for the noise experiments).