Certifiably Robust Policy Learning against Adversarial Multi-Agent Communication

Authors: Yanchao Sun, Ruijie Zheng, Parisa Hassanzadeh, Yongyuan Liang, Soheil Feizi, Sumitra Ganesh, Furong Huang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in multiple environments verify that our defense significantly improves the robustness of trained policies against various types of attacks. In this section, we verify the robustness of our AME in multiple different CMARL environments against various communication attack algorithms. Then, we conduct hyperparameter tests for the ablation size k and the sample size D (for the variant of AME introduced in Section 4.3).
Researcher Affiliation Collaboration University of Maryland, College Park {ycs, rzheng12, sfeizi, furongh}@umd.edu JPMorgan AI Research {parisa.hassanzadeh, sumitra.ganesh}@jpmchase.com Shanghai AI Lab cheryll Liang@outlook.com
Pseudocode Yes Algorithm 1 Training Phase of AME
Open Source Code Yes Our implementation of the AME algorithm and the Food Collector environment are available at https://github.com/umd-huang-lab/cmarl_ame.git.
Open Datasets Yes More specifically, we use N = 9 agents in the MNIST dataset of handwritten digits (Le Cun et al., 1998). We use the same environment setup as the one in (Singh et al., 2019).
Dataset Splits No The paper states that the MNIST dataset consists of '60,000 training images and 10,000 testing images' but does not explicitly provide details about a separate validation dataset split or how such a split is created if used.
Hardware Specification Yes All experiments are conducted on NVIDIA GeForce RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions software like 'stable-baselines3 (Raffin et al., 2021)' but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch version, specific stable-baselines3 version).
Experiment Setup Yes For the policy network, we use a multi-layer perceptron (MLP) with two hidden layers of size 64. ... we use a learning rate of 0.0003 for the policy network, and a learning rate of 0.001 is used for the value network. We use the Adam optimizer with β1 = 0.99 and β2 = 0.999. For every training epoch, the PPO agent interacts with the environment for 4000 steps, and it is trained for 500 epochs in our experiments.