A Sober Look at the Robustness of CLIPs to Spurious Features

Authors: Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, Tong Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations show that the spurious features captured by Counter Animal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for Image Net models. Our experiments center on the evaluation and the analysis of our Counter Animal dataset.
Researcher Affiliation Academia Qizhou Wang1 Yong Lin2 Yongqiang Chen3 Ludwig Schmidt4 Bo Han1 Tong Zhang5 1TMLR Group, Department of Computer Science, Hong Kong Baptist University 2The Hong Kong University of Science and Technology 3The Chinese University of Hong Kong 4University of Washington 5University of Illinois Urbana-Champaign
Pseudocode No The paper describes methods in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We will release our dataset and the evaluation codes in the future.
Open Datasets Yes The Counter Animal dataset is curated based on raw photos collected from i Naturalist. We establish an anonymous repository for the access to our dataset, which can be found at the link of https://figshare.com/s/f9b0f34312168f4a8ddb.
Dataset Splits No The paper describes its new dataset, Counter Animal, which is split into 'easy' and 'hard' groups for evaluating zero-shot performance of pre-trained models. It does not provide traditional training, validation, or testing splits in the context of training new models for its experiments.
Hardware Specification Yes All experiments are realized by Pytorch 1.81 with CUDA 11.1, using machines equipped with Ge Force RTX 3090 GPUs and AMD Threadripper 3960X Processors.
Software Dependencies Yes All experiments are realized by Pytorch 1.81 with CUDA 11.1
Experiment Setup Yes We evaluate a series of CLIP models on the Counter Animal dataset for their zero-shot performance. For each class, we use the pre-defined prompt of A photo of <object label>. as in our data collection procedure and the similarity between image and text embeddings in classification. By default, we use the label space of the Image Net-1K dataset and report the top-1 accuracy, i.e., the 1 vs. 1000 setup. Moreover, when involving more advanced LVLMs, we adopt the 1 vs. 20 setup where we employ the top-20 most confusing classes regarding CLIP-LAION400M-Vi T-B/32 as the candidate label space.