Order Matters: Agent-by-agent Policy Optimization
Authors: Xihuai Wang, Zheng Tian, Ziyu Wan, Ying Wen, Jun Wang, Weinan Zhang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: Star Craft II, Multiagent Mu Jo Co, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines. |
| Researcher Affiliation | Collaboration | Xihuai Wang1,2 , Zheng Tian3 , Ziyu Wan1,2, Ying Wen1, Jun Wang2,4, Weinan Zhang1 1 Shanghai Jiao Tong University, 2 Digital Brain Lab, 3 Shanghai Tech University, 4 University College London |
| Pseudocode | Yes | Algorithm 1: Agent-by-agent Policy Optimization (A2PO) and Algorithm 2: Agent-by-agent Policy Optimization (Parameter Sharing) |
| Open Source Code | Yes | The source code of this paper is available at https:// anonymous.4open.science/r/A2PO. |
| Open Datasets | Yes | Star Craft II Multi-agent Challenge (SMAC) (Samvelyan et al., 2019), Multi-agent Mu Jo Co (MA-Mu Jo Co) (de Witt et al., 2020), Multi-agent Particle Environment (MPE) (Lowe et al., 2017)3, and more challenging Google Research Football full-game scenarios (Kurach et al., 2020). |
| Dataset Splits | No | The paper evaluates reinforcement learning agents in simulated environments (Star Craft II, Mu Jo Co, MPE, GRF). These environments involve continuous interaction and episodic learning, not static datasets with pre-defined train/validation/test splits. Performance metrics are gathered from agent-environment interactions rather than partitioned datasets. |
| Hardware Specification | No | No explicit hardware specifications (e.g., specific GPU/CPU models, memory details) are mentioned for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | We tune several hyper-parameters in all the benchmarks, other hyper-parameters refer to the settings used in MAPPO. cϵ are selected to be 0.5 in all the tasks. (B.4 Hyper-parameters) And subsequent tables like Table 7, 8, 9, 10, 11 detail specific hyperparameter values for each task (e.g., ppo epoch, actor lr, critic lr, λ, ϵ). |