Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization
Authors: Xiangsen Wang, Haoran Xu, Yinan Zheng, Xianyuan Zhan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on comprehensive experiments on the offline multi-agent Mu Jo Co and Star Craft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks. We evaluate our method using various types of offline datasets on both multi-agent Mu Jo Co [15] and Star Craft Multi-Agent Challenge (SMAC) tasks [16]. Under all settings, OMIGA achieves better performance and enjoys faster convergence compared with other strong baselines. |
| Researcher Affiliation | Academia | Xiangsen Wang1 Haoran Xu2 Yinan Zheng3 Xianyuan Zhan3,4 1 Beijing Jiaotong University 2 UT Austin 3 Tsinghua University 4 Shanghai Artificial Intelligence Laboratory |
| Pseudocode | Yes | Algorithm 1 Pseudocode of OMIGA |
| Open Source Code | Yes | Our code is available at https://github.com/Zheng Yinan-AIR/OMIGA. |
| Open Datasets | Yes | We choose multi-agent Mu Jo Co [15] and Star Craft Multi-Agent Challenge (SMAC) [16] as our experiment environments. The offline dataset we used is provided by Meng et al. [38], which is collected from the online trained MAPPO agents [47], and is the largest open offline dataset on SMAC. |
| Dataset Splits | No | The paper describes the collection of different quality offline datasets (expert, medium, medium-replay, medium-expert) but does not specify how these datasets are partitioned into explicit training, validation, and test splits for the experiments. |
| Hardware Specification | Yes | In this paper, all experiments are implemented with Pytorch and executed on NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions that experiments are "implemented with Pytorch", but it does not specify the version number of PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | The local Q-value, state-value networks and policy networks of OMIGA are represented by 3-layer Re LU activated MLPs with 256 units for each hidden layer. For the weight network, we use 2-layer Re LU-activated MLPs with 64 units for each hidden layer. All the networks are optimized by Adam optimizer. Hyperparameter Value Shared parameters Q-value network learning rate 5e-4 Policy network learning rate 5e-4 Optimizer Adam Target update rate 0.005 Batch size 128 Discount factor 0.99 Hidden dimension 256 Weight network hidden dimension 64 OMIGA State-value network learning rate 5e-4 Regularization parameter α 1 or 10 |