Aligning Individual and Collective Objectives in Multi-Agent Cooperation
Authors: Yang Li, Wenhao Zhang, Jianhong Wang, Shao Zhang, Yali Du, Ying Wen, Wei Pan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness of our algorithm Ag A through benchmark environments for testing mixed-motive collaboration with small-scale agents such as the two-player public good game and the sequential social dilemma games, Cleanup and Harvest, as well as our self-developed large-scale environment in the game Star Craft II. |
| Researcher Affiliation | Academia | Yang Li The University of Manchester yang.li-4@manchester.ac.uk Wenhao Zhang Shanghai Jiao Tong University wenhao_zhang@sjtu.edu.cn Jianhong Wang INFORMED-AI Hub University of Bristol jianhong.wang@bristol.ac.uk Shao Zhang Shanghai Jiao Tong University shaozhang@sjtu.edu.cn Yali Du King s College London yali.du@kcl.ac.uk Ying Wen Shanghai Jiao Tong University ying.wen@sjtu.edu.cn Wei Pan The University of Manchester wei.pan@manchester.ac.uk |
| Pseudocode | Yes | Algorithm 1 Altruistic Gradient Adjustment (Ag A) |
| Open Source Code | No | We provide the details to reproduce the main experimental results in Appendix E and the main code in supplemental material (we will release them when the paper is accepted). |
| Open Datasets | Yes | In addition to commonly used testbeds like the public goods matrix game and sequential social dilemma games (Cleanup and Harvest) [Leibo et al., 2017]... we introduce a more complex mixed-motive environment called Selfish MMM2, an adaptation of the MMM2 map from the Star Craft II game [Samvelyan et al., 2019]. |
| Dataset Splits | No | The paper describes the use of simulation environments (Cleanup, Harvest, Selfish-MMM2) for training and evaluation. However, it does not specify explicit training, validation, and test *dataset splits* with percentages or sample counts for these environments, as they are typically run for a number of steps/episodes rather than being static datasets that are split. |
| Hardware Specification | Yes | Most experiments were conducted on a node with a Tesla V100 GPU (32GB memory) and 40 CPU cores. The hyper-parameters for PPO training are as follows. Most experiments were conducted on a node with two NVIDIA Ge Force RTX 3090 GPUs and 32 CPU cores. |
| Software Dependencies | No | The paper mentions using 'PPO algorithm in stable-baselines3' and 'Adam optimizer', along with 'IPPO' and 'MAPPO' algorithms, but does not specify exact version numbers for these software packages or libraries. |
| Experiment Setup | Yes | The hyper-parameters for PPO training are as follows. The learning rate is 1e-4 The PPO clipping factor is 0.2. The value loss coefficient is 1. The entropy coefficient is 0.001. The γ is 0.99. The total environment step is 1e7 for Harvest and 2e7 for Cleanup. The environment episode length is 1000. The grad clip is 40. The hyper-parameters for PPO-based training are as follows. The learning rate is 5e-4 The PPO clipping factor is 0.2. The value loss coefficient is 1. The entropy coefficient is 0.01. The γ is 0.99. The total environment step is 1e7. The factor β in reward function is 1. The environment episode length is 400. |