Illusory Attacks: Information-theoretic detectability matters in adversarial attacks
Authors: Tim Franzmeyer, Stephen Marcus McAleer, Joao F. Henriques, Jakob Nicolaus Foerster, Philip Torr, Adel Bibi, Christian Schroeder de Witt
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to existing attacks, we empirically find ϵ-illusory attacks to be significantly harder to detect with automated methods, and a small study with human participants1 suggests they are similarly harder to detect for humans. |
| Researcher Affiliation | Academia | University of Oxford Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 ϵ-illusory training (dual ascent) |
| Open Source Code | Yes | To facilitate the reproducibility of our results, we release the code on out project page at https://tinyurl.com/illusory-attacks. |
| Open Datasets | Yes | We consider the simple stochastic MDP explained in Figure 2 and the four standard benchmark environments Cart Pole, Pendulum, Hopper and Half Cheetah (see Figure 6 in the Appendix), which have continuous state spaces whose dimensionalities range from 1 to 17, as well as continuous and discrete action spaces. The mean and standard deviations of both detection and performance results are estimated from 200 independent episodes per each of 5 random seeds. Victim policies are pre-trained in unattacked environments, and frozen during adversary training. |
| Dataset Splits | No | The paper does not explicitly state train/validation/test dataset splits with percentages, absolute sample counts, or references to predefined splits for its main experiments. It mentions training a detector and tuning a decision rule but not general dataset splits. |
| Hardware Specification | Yes | All reported times are measured using an NVIDIA Ge Force GTX 1080 and an Intel Xeon Silver 4116 CPU. |
| Software Dependencies | No | The paper mentions using implementations of PPO and SAC given in Raffin et al. (2021), but it does not specify explicit version numbers for these software libraries or components (e.g., "Stable-Baselines3 2.0.0" or "PyTorch 1.9"). |
| Experiment Setup | Yes | We shortened the episodes in Hopper and Half Cheetah to 300 steps to speed up training. The transition function is implemented using the physics engines given in all environments. We normalize observations by the maximum absolute observation. We train the victim with PPO (Schulman et al., 2017) and use the implementation of PPO given in Raffin et al. (2021), while not making any changes to the given hyperparameters. In both environments we train the victim for 1 million environment steps. We implement the illusory adversary agent with SAC (Haarnoja et al., 2018), where we likewise use the implementation given in Raffin et al. (2021). ... We further ran a small study over hyperparameters α {0.01, , 0.1, 1} and the initial value for λ {10, 100} and chose the best performing combination. We train all adversarial attacks for four million environment steps. |