Robust Contrastive Language-Image Pretraining against Data Poisoning and Backdoor Attacks

Authors: Wenhan Yang, Jingdong Gao, Baharan Mirzasoleiman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments show that ROCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models.
Researcher Affiliation Collaboration Wenhan Yang Jingdong Gao Baharan Mirzasoleiman {hangeryang18, mxuan, baharan}@cs.ucla.edu Computer Science Department, UCLA and This research was supported by the National Science Foundation CAREER Award 2146492 and Cisco Systems.
Pseudocode Yes Algorithm 1 Robust CLIP pre-training (ROCLIP)
Open Source Code Yes 1Code is available at https://github.com/BigML-CS-UCLA/RoCLIP
Open Datasets Yes We use Conceptual Captions 3M (CC3M) (Sharma et al., 2018) as our pre-training dataset. ... We assess our method on 10 downstream datasets introduced by (Kornblith et al., 2019), the detail of which can be found in Table 1.
Dataset Splits Yes For pre-training, we randomly sampled 1M image-caption pairs from CC3M as our training dataset. ... We choose a random target image xt from the conceptual captions validation set, and then choose a random target class from the Image Net test set to generate a set of |Tadv| adversarial captions.
Hardware Specification No The paper discusses the experimental process and training details but does not specify any particular CPU, GPU models, or other hardware specifications used for running the experiments.
Software Dependencies No The paper mentions using an 'open-source implementation of CLIP' with 'Res Net50 as the image encoder and Transformer as the text encoder' and refers to specific loss functions like 'Info NCE loss' and augmentation policies like 'EDA', but it does not provide specific version numbers for any software libraries, frameworks, or dependencies used.
Experiment Setup Yes Each experiment is run with a batch size of 512 for 24 epochs... We select 2% of the total dataset size as our pool size and K = 3 in our experiments. ... In particular, we use random image cropping, horizontal flipping, color jittering (Wu et al., 2018), grayscale conversion (Wu et al., 2018), and blurring (Chen et al., 2020) in our image augmentation policies. For the text augmentation, we use the EDA proposed by (Wei & Zou, 2019), which includes synonym replacement, random swap, and random deletion as its augmentation policies.