Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Authors: Erfan Shayegani, Yue Dong, Nael Abu-Ghazaleh
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our attacks achieve a high success rate for two different VLMs we evaluated, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models. |
| Researcher Affiliation | Academia | Erfan Shayegani, Yue Dong & Nael Abu-Ghazaleh Department of Computer Science University of California, Riverside {sshay004,yued,naelag}@ucr.edu |
| Pseudocode | Yes | Algorithm 1: Adversarial Image Generator via Embedding Space Matching |
| Open Source Code | No | The paper states: 'We plan to release our dataset with 4 types of malicious triggers and prompts.' (Footnote 1), but does not explicitly state the release of the source code for the methodology or provide a link. |
| Open Datasets | Yes | Zou et al. (2023) and Bailey et al. (2023) utilize Adv Bench, which consists of 521 lines of harmful behaviors and 575 lines for harmful strings. |
| Dataset Splits | No | The paper describes testing repetitions ('repeating each experiment 25 times') and the scenarios used, but does not provide specific train/validation/test dataset splits or reference predefined splits for reproducibility. |
| Hardware Specification | Yes | typically within 10 to 15 minutes when utilizing a Google Colab T4 GPU |
| Software Dependencies | No | The paper mentions using specific CLIP models and the ADAM optimizer, but does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, etc.). |
| Experiment Setup | Yes | Out of our experiments, we empirically found that a distance around 0.3 or lower often indicates a powerful adversarial sample that will be as effective as the target malicious trigger when presented to the VLM. |