Aligning Individual and Collective Objectives in Multi-Agent Cooperation

Authors: Yang Li, Wenhao Zhang, Jianhong Wang, Shao Zhang, Yali Du, Ying Wen, Wei Pan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of our algorithm Ag A through benchmark environments for testing mixed-motive collaboration with small-scale agents such as the two-player public good game and the sequential social dilemma games, Cleanup and Harvest, as well as our self-developed large-scale environment in the game Star Craft II.
Researcher Affiliation Academia Yang Li The University of Manchester yang.li-4@manchester.ac.uk Wenhao Zhang Shanghai Jiao Tong University wenhao_zhang@sjtu.edu.cn Jianhong Wang INFORMED-AI Hub University of Bristol jianhong.wang@bristol.ac.uk Shao Zhang Shanghai Jiao Tong University shaozhang@sjtu.edu.cn Yali Du King s College London yali.du@kcl.ac.uk Ying Wen Shanghai Jiao Tong University ying.wen@sjtu.edu.cn Wei Pan The University of Manchester wei.pan@manchester.ac.uk
Pseudocode Yes Algorithm 1 Altruistic Gradient Adjustment (Ag A)
Open Source Code No We provide the details to reproduce the main experimental results in Appendix E and the main code in supplemental material (we will release them when the paper is accepted).
Open Datasets Yes In addition to commonly used testbeds like the public goods matrix game and sequential social dilemma games (Cleanup and Harvest) [Leibo et al., 2017]... we introduce a more complex mixed-motive environment called Selfish MMM2, an adaptation of the MMM2 map from the Star Craft II game [Samvelyan et al., 2019].
Dataset Splits No The paper describes the use of simulation environments (Cleanup, Harvest, Selfish-MMM2) for training and evaluation. However, it does not specify explicit training, validation, and test *dataset splits* with percentages or sample counts for these environments, as they are typically run for a number of steps/episodes rather than being static datasets that are split.
Hardware Specification Yes Most experiments were conducted on a node with a Tesla V100 GPU (32GB memory) and 40 CPU cores. The hyper-parameters for PPO training are as follows. Most experiments were conducted on a node with two NVIDIA Ge Force RTX 3090 GPUs and 32 CPU cores.
Software Dependencies No The paper mentions using 'PPO algorithm in stable-baselines3' and 'Adam optimizer', along with 'IPPO' and 'MAPPO' algorithms, but does not specify exact version numbers for these software packages or libraries.
Experiment Setup Yes The hyper-parameters for PPO training are as follows. The learning rate is 1e-4 The PPO clipping factor is 0.2. The value loss coefficient is 1. The entropy coefficient is 0.001. The γ is 0.99. The total environment step is 1e7 for Harvest and 2e7 for Cleanup. The environment episode length is 1000. The grad clip is 40. The hyper-parameters for PPO-based training are as follows. The learning rate is 5e-4 The PPO clipping factor is 0.2. The value loss coefficient is 1. The entropy coefficient is 0.01. The γ is 0.99. The total environment step is 1e7. The factor β in reward function is 1. The environment episode length is 400.