Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization

Authors: Xiangsen Wang, Haoran Xu, Yinan Zheng, Xianyuan Zhan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on comprehensive experiments on the offline multi-agent Mu Jo Co and Star Craft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks. We evaluate our method using various types of offline datasets on both multi-agent Mu Jo Co [15] and Star Craft Multi-Agent Challenge (SMAC) tasks [16]. Under all settings, OMIGA achieves better performance and enjoys faster convergence compared with other strong baselines.
Researcher Affiliation Academia Xiangsen Wang1 Haoran Xu2 Yinan Zheng3 Xianyuan Zhan3,4 1 Beijing Jiaotong University 2 UT Austin 3 Tsinghua University 4 Shanghai Artificial Intelligence Laboratory
Pseudocode Yes Algorithm 1 Pseudocode of OMIGA
Open Source Code Yes Our code is available at https://github.com/Zheng Yinan-AIR/OMIGA.
Open Datasets Yes We choose multi-agent Mu Jo Co [15] and Star Craft Multi-Agent Challenge (SMAC) [16] as our experiment environments. The offline dataset we used is provided by Meng et al. [38], which is collected from the online trained MAPPO agents [47], and is the largest open offline dataset on SMAC.
Dataset Splits No The paper describes the collection of different quality offline datasets (expert, medium, medium-replay, medium-expert) but does not specify how these datasets are partitioned into explicit training, validation, and test splits for the experiments.
Hardware Specification Yes In this paper, all experiments are implemented with Pytorch and executed on NVIDIA V100 GPUs.
Software Dependencies No The paper mentions that experiments are "implemented with Pytorch", but it does not specify the version number of PyTorch or any other software dependencies with their versions.
Experiment Setup Yes The local Q-value, state-value networks and policy networks of OMIGA are represented by 3-layer Re LU activated MLPs with 256 units for each hidden layer. For the weight network, we use 2-layer Re LU-activated MLPs with 64 units for each hidden layer. All the networks are optimized by Adam optimizer. Hyperparameter Value Shared parameters Q-value network learning rate 5e-4 Policy network learning rate 5e-4 Optimizer Adam Target update rate 0.005 Batch size 128 Discount factor 0.99 Hidden dimension 256 Weight network hidden dimension 64 OMIGA State-value network learning rate 5e-4 Regularization parameter α 1 or 10