Data Poisoning Attacks Against Multimodal Encoders

Authors: Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, Yang Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on different datasets and model architectures show that all three attacks can achieve significant attack performance while maintaining model utility in both visual and linguistic modalities. Furthermore, we observe that the poisoning effect differs between different modalities. To mitigate the attacks, we propose both pretraining and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model s utility.
Researcher Affiliation Academia 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Saarland, Germany 2University of Lausanne, Lausanne, Switzerland 3University of Birmingham, Birmingham, England, UK.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the attack methodology in narrative text.
Open Source Code Yes Our code is available at https: //github.com/zqypku/mm_poison/.
Open Datasets Yes We rely on two training datasets, i.e., Flickr-PASCAL and COCO. They are derived from three widely used text-image datasets, namely Flickr30k (Young et al., 2014) (abbreviated as Flickr), PASCAL (Rashtchian et al., 2010), and COCO (Chen et al., 2015). ... Here, we introduce Visual Genome (VG) (Krishna et al., 2017), a representative image caption dataset.
Dataset Splits No The paper specifies training and testing data splits (e.g., "half of PASCAL as the training data and the other half of PASCAL as the test data"), but does not explicitly mention a separate "validation" set or its specific use for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using pre-trained CLIP models but no details on the authors' experimental setup.
Software Dependencies No The paper mentions software components like "CLIP", "Vision Transformer Vi T-B/32", "Transformer", and "Adam optimizer". However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The initial learning rate is set to be 10 5 with a weight decay rate of 0.2. For the cosine scheduler, we set a minimum learning rate of 10 6 and a decay rate of 1.0. Then we fine-tune the pre-trained model for 10 epochs with a batch size of 128.