Certifiably Robust Policy Learning against Adversarial Multi-Agent Communication
Authors: Yanchao Sun, Ruijie Zheng, Parisa Hassanzadeh, Yongyuan Liang, Soheil Feizi, Sumitra Ganesh, Furong Huang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in multiple environments verify that our defense significantly improves the robustness of trained policies against various types of attacks. In this section, we verify the robustness of our AME in multiple different CMARL environments against various communication attack algorithms. Then, we conduct hyperparameter tests for the ablation size k and the sample size D (for the variant of AME introduced in Section 4.3). |
| Researcher Affiliation | Collaboration | University of Maryland, College Park {ycs, rzheng12, sfeizi, furongh}@umd.edu JPMorgan AI Research {parisa.hassanzadeh, sumitra.ganesh}@jpmchase.com Shanghai AI Lab cheryll Liang@outlook.com |
| Pseudocode | Yes | Algorithm 1 Training Phase of AME |
| Open Source Code | Yes | Our implementation of the AME algorithm and the Food Collector environment are available at https://github.com/umd-huang-lab/cmarl_ame.git. |
| Open Datasets | Yes | More specifically, we use N = 9 agents in the MNIST dataset of handwritten digits (Le Cun et al., 1998). We use the same environment setup as the one in (Singh et al., 2019). |
| Dataset Splits | No | The paper states that the MNIST dataset consists of '60,000 training images and 10,000 testing images' but does not explicitly provide details about a separate validation dataset split or how such a split is created if used. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA GeForce RTX 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions software like 'stable-baselines3 (Raffin et al., 2021)' but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch version, specific stable-baselines3 version). |
| Experiment Setup | Yes | For the policy network, we use a multi-layer perceptron (MLP) with two hidden layers of size 64. ... we use a learning rate of 0.0003 for the policy network, and a learning rate of 0.001 is used for the value network. We use the Adam optimizer with β1 = 0.99 and β2 = 0.999. For every training epoch, the PPO agent interacts with the environment for 4000 steps, and it is trained for 500 epochs in our experiments. |