Data Poisoning Attacks Against Multimodal Encoders
Authors: Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, Yang Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on different datasets and model architectures show that all three attacks can achieve significant attack performance while maintaining model utility in both visual and linguistic modalities. Furthermore, we observe that the poisoning effect differs between different modalities. To mitigate the attacks, we propose both pretraining and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model s utility. |
| Researcher Affiliation | Academia | 1CISPA Helmholtz Center for Information Security, Saarbr ucken, Saarland, Germany 2University of Lausanne, Lausanne, Switzerland 3University of Birmingham, Birmingham, England, UK. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes the attack methodology in narrative text. |
| Open Source Code | Yes | Our code is available at https: //github.com/zqypku/mm_poison/. |
| Open Datasets | Yes | We rely on two training datasets, i.e., Flickr-PASCAL and COCO. They are derived from three widely used text-image datasets, namely Flickr30k (Young et al., 2014) (abbreviated as Flickr), PASCAL (Rashtchian et al., 2010), and COCO (Chen et al., 2015). ... Here, we introduce Visual Genome (VG) (Krishna et al., 2017), a representative image caption dataset. |
| Dataset Splits | No | The paper specifies training and testing data splits (e.g., "half of PASCAL as the training data and the other half of PASCAL as the test data"), but does not explicitly mention a separate "validation" set or its specific use for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using pre-trained CLIP models but no details on the authors' experimental setup. |
| Software Dependencies | No | The paper mentions software components like "CLIP", "Vision Transformer Vi T-B/32", "Transformer", and "Adam optimizer". However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | The initial learning rate is set to be 10 5 with a weight decay rate of 0.2. For the cosine scheduler, we set a minimum learning rate of 10 6 and a decay rate of 1.0. Then we fine-tune the pre-trained model for 10 epochs with a batch size of 128. |