CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Authors: Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin G. Jamieson, Simon S. Du

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our methods on the data selection benchmark, Data Comp [1]. Compared to the best baseline using only Open AI s CLIP-L/14, our methods achieve a 5.3% improvement on Image Net-1k and a 2.8% improvement on 38 downstream evaluation tasks.
Researcher Affiliation Academia Yiping Wang University of Washington Yifang Chen University of Washington Wendan Yan University of Washington Alex Fang University of Washington Wenjing Zhou University of Michigan Kevin Jamieson University of Washington Simon Shaolei Du University of Washington
Pseudocode Yes Algorithm 1 neg CLIPLoss
Open Source Code Yes Codes are available at https://github.com/ypwang61/negCLIPLoss_NormSim.
Open Datasets Yes We test our methods on the data selection benchmark, Data Comp [1]. Compared to the best baseline using only Open AI s CLIP-L/14, our methods achieve a 5.3% improvement on Image Net-1k and a 2.8% improvement on 38 downstream evaluation tasks.
Dataset Splits No The paper states, 'We adhere to the standardized training and evaluation protocols of the Data Comp benchmark [1].' and discusses training on subsets of Data Comp-medium and evaluating on 38 downstream datasets. While it implies standard splits, it does not explicitly provide validation dataset splits in terms of percentages or sample counts.
Hardware Specification Yes For MLM, they mention that they need 6.1 minutes to process 10k samples on A100, which results in 1120 A100 hours for our dataset (110M).
Software Dependencies No The paper mentions 'pytorch-style parallel matrix calculation' and the 'faiss library' for k-means clustering, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We employ the medium-scale training configuration of Data Comp (Data Comp-medium). It provides a substantial dataset comprising 128 million low-quality, web-curated image-text pairs to be filtered. Once the data subset is obtained by some data filtering strategy, it will be used to train a fixed CLIP-B/32 model in a fixed training budget that allows the model to pass 128 million data points an epoch.