Conservative Offline Policy Adaptation in Multi-Agent Games

Authors: Chengjie Wu, Pingzhong Tang, Jun Yang, Yujing Hu, Tangjie Lv, Changjie Fan, Chongjie Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate that CSP outperforms non-conservative baselines in various environments, including Maze, predator-prey, Mu Jo Co, and Google Football.
Researcher Affiliation Collaboration Chengjie Wu1, Pingzhong Tang12, Jun Yang3, Yujing Hu4, Tangjie Lv4, Changjie Fan4, Chongjie Zhang5 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Turingsense 3Department of Automation, Tsinghua University 4Fuxi AI Lab, Net Ease 5Department of Computer Science & Engineering, Washington University in St. Louis
Pseudocode Yes Algorithm 1 Constrained Self-Play (CSP)
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets Yes Empirically, we evaluate the effectiveness of our algorithm in four environments: a didactic maze environment, predator-prey in MPE [39], a competitive two-agent Mu Jo Co environment [2, 10] requiring continuous control, and Google Football [21].
Dataset Splits No The paper discusses training and testing performance but does not specify validation dataset splits (e.g., percentages or sample counts for a validation set).
Hardware Specification Yes Each seed is run on a GPU server with one NVIDIA P100 GPU, and Intel(R) Xeon(R) Gold 6145 CPU @ 2.00GHz CPU.
Software Dependencies No The paper mentions using MAPPO [48] as the base RL algorithm but does not specify versions for programming languages or libraries (e.g., Python 3.x, PyTorch x.x).
Experiment Setup Yes The hyper-parameters are listed in Table 6, 7, and 8. and Table 5: Hyper-parameters for maze. ppo_epoch 1 num_mini_batch 1 entropy_coef 0.3 use_gae True gamma 0.99999 gae_lambda 0.95 critic_lr 7e-4 lr 7e-4 weight_decay 0 adam_eps 1e-5 n_rollout_threads 20 ppo_episode_length 12 data_chunk_length 12 steps 1.8K max_grad_norm 0.5 bc_regularization_coef 10 bc_batch_size 8 network MLP