Illusory Attacks: Information-theoretic detectability matters in adversarial attacks

Authors: Tim Franzmeyer, Stephen Marcus McAleer, Joao F. Henriques, Jakob Nicolaus Foerster, Philip Torr, Adel Bibi, Christian Schroeder de Witt

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared to existing attacks, we empirically find ϵ-illusory attacks to be significantly harder to detect with automated methods, and a small study with human participants1 suggests they are similarly harder to detect for humans.
Researcher Affiliation Academia University of Oxford Carnegie Mellon University
Pseudocode Yes Algorithm 1 ϵ-illusory training (dual ascent)
Open Source Code Yes To facilitate the reproducibility of our results, we release the code on out project page at https://tinyurl.com/illusory-attacks.
Open Datasets Yes We consider the simple stochastic MDP explained in Figure 2 and the four standard benchmark environments Cart Pole, Pendulum, Hopper and Half Cheetah (see Figure 6 in the Appendix), which have continuous state spaces whose dimensionalities range from 1 to 17, as well as continuous and discrete action spaces. The mean and standard deviations of both detection and performance results are estimated from 200 independent episodes per each of 5 random seeds. Victim policies are pre-trained in unattacked environments, and frozen during adversary training.
Dataset Splits No The paper does not explicitly state train/validation/test dataset splits with percentages, absolute sample counts, or references to predefined splits for its main experiments. It mentions training a detector and tuning a decision rule but not general dataset splits.
Hardware Specification Yes All reported times are measured using an NVIDIA Ge Force GTX 1080 and an Intel Xeon Silver 4116 CPU.
Software Dependencies No The paper mentions using implementations of PPO and SAC given in Raffin et al. (2021), but it does not specify explicit version numbers for these software libraries or components (e.g., "Stable-Baselines3 2.0.0" or "PyTorch 1.9").
Experiment Setup Yes We shortened the episodes in Hopper and Half Cheetah to 300 steps to speed up training. The transition function is implemented using the physics engines given in all environments. We normalize observations by the maximum absolute observation. We train the victim with PPO (Schulman et al., 2017) and use the implementation of PPO given in Raffin et al. (2021), while not making any changes to the given hyperparameters. In both environments we train the victim for 1 million environment steps. We implement the illusory adversary agent with SAC (Haarnoja et al., 2018), where we likewise use the implementation given in Raffin et al. (2021). ... We further ran a small study over hyperparameters α {0.01, , 0.1, 1} and the initial value for λ {10, 100} and chose the best performing combination. We train all adversarial attacks for four million environment steps.