VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Authors: Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, Fenglong Ma

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines
Researcher Affiliation Academia 1The Pennsylvania State University, 2Zhejiang University, 3 Xi an Jiaotong University, 4Dalian University of Technology, 5Stony Brook University
Pseudocode Yes Algorithm 1 VLATTACK
Open Source Code Yes Source code can be found in the link https://github.com/ericyinyzy/VLAttack.
Open Datasets Yes Experiments are conducted on five pre-trained models... two downstream tasks, including the visual question answering (VQA) task on the VQAv2 dataset [48] and the visual reasoning (VR) task on the NLVR2 dataset [50]. For Unitab, evaluations are made on the VQAv2 dataset for the VQA task and on Ref COCO, Ref COCO+, and Ref COCOg datasets [49] for the Referring Expression Comprehension (REC) task... For OFA, we implement experiments on the same tasks as Unitab and add the SNLI-VE dataset [51] for the visual entailment (VE) task... We evaluate the uni-modal tasks on OFA [5] using MSCOCO [52] for the image captioning task and Image Net-1K [53] for the image classification task. We also evaluate CLIP [29] on the image classification task on the SVHN [54] dataset.
Dataset Splits Yes For each dataset, we sample 5K correctly predicted samples in the corresponding validation dataset to evaluate the ASR performance. All validation datasets follow the same split settings as adopted in the respective attack models.
Hardware Specification Yes All experiments are conducted on a single GTX A6000 GPU.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, TensorFlow version).
Experiment Setup Yes For the perturbation parameters of images, we follow the setting in the common transferable image attacks [18, 19] and set the maximum perturbation σi of each pixel to 16/255 on all tasks except REC. Considering that even a single coordinate change can affect the final grounding results to a great extent, the σi of the REC task is 4/255 to better highlight the ASR differences among distinct methods. The total iteration number N and step size are set to 40 and 0.01 by following the projected gradient decent method [30], and Ns is 20. For the perturbation on the text, the semantic similarity constraint σs is set to 0.95, and the number of maximum modified words is set to 1 by following the previous text-attack work [15, 24] to ensure the semantic consistency and imperceptibility.