Robust Multi-Agent Reinforcement Learning with Model Uncertainty
Authors: Kaiqing Zhang, TAO SUN, Yunzhe Tao, Sahika Genc, Sunil Mallya, Tamer Basar
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the proposed algorithm outperforms several baseline MARL methods that do not account for the model uncertainty, in several standard but uncertain cooperative and competitive MARL environments. |
| Researcher Affiliation | Collaboration | Kaiqing Zhang Tao Sun Yunzhe Tao Sahika Genc Sunil Mallya Tamer Ba sar Department of ECE and CSL, University of Illinois at Urbana-Champaign Amazon Web Services {kzhang66, basar1}@illinois.edu {suntao, yunzhet, sahika, smallya}@amazon.com |
| Pseudocode | Yes | See Algorithm 1 in Supplementary C for the pseudo-code of our actor-critic-based robust MARL algorithm. |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the described methodology. |
| Open Datasets | Yes | based on the multi-agent particle environments developed in [13]. |
| Dataset Splits | No | The paper does not specify exact percentages or sample counts for training, validation, or test splits. It references environments from a prior work but does not detail the data partitioning within this paper. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of |
| Experiment Setup | Yes | In order to test the robustness of the proposed algorithm, which is referred to as Robust-MADDPG, or R-MADDPG for brevity, we impose different levels of uncertainty to the rewards returned from each particle environment. In particular, we use truncated Gaussian noise, defined as R(s, a) = Ntrunc(R(s, a), λ), to ensure the compactness of the uncertainty set. The parameter λ controls the uncertainty level of the rewards and R(s, a) is the true reward. ... We report statistics that are averaged across 5 runs for cooperative navigation, and 25 runs for other scenarios where each agent or adversary is trained five times. |