CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Authors: Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin G. Jamieson, Simon S. Du
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our methods on the data selection benchmark, Data Comp [1]. Compared to the best baseline using only Open AI s CLIP-L/14, our methods achieve a 5.3% improvement on Image Net-1k and a 2.8% improvement on 38 downstream evaluation tasks. |
| Researcher Affiliation | Academia | Yiping Wang University of Washington Yifang Chen University of Washington Wendan Yan University of Washington Alex Fang University of Washington Wenjing Zhou University of Michigan Kevin Jamieson University of Washington Simon Shaolei Du University of Washington |
| Pseudocode | Yes | Algorithm 1 neg CLIPLoss |
| Open Source Code | Yes | Codes are available at https://github.com/ypwang61/negCLIPLoss_NormSim. |
| Open Datasets | Yes | We test our methods on the data selection benchmark, Data Comp [1]. Compared to the best baseline using only Open AI s CLIP-L/14, our methods achieve a 5.3% improvement on Image Net-1k and a 2.8% improvement on 38 downstream evaluation tasks. |
| Dataset Splits | No | The paper states, 'We adhere to the standardized training and evaluation protocols of the Data Comp benchmark [1].' and discusses training on subsets of Data Comp-medium and evaluating on 38 downstream datasets. While it implies standard splits, it does not explicitly provide validation dataset splits in terms of percentages or sample counts. |
| Hardware Specification | Yes | For MLM, they mention that they need 6.1 minutes to process 10k samples on A100, which results in 1120 A100 hours for our dataset (110M). |
| Software Dependencies | No | The paper mentions 'pytorch-style parallel matrix calculation' and the 'faiss library' for k-means clustering, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We employ the medium-scale training configuration of Data Comp (Data Comp-medium). It provides a substantial dataset comprising 128 million low-quality, web-curated image-text pairs to be filtered. Once the data subset is obtained by some data filtering strategy, it will be used to train a fixed CLIP-B/32 model in a fixed training budget that allows the model to pass 128 million data points an epoch. |