Learning Transferable Visual Models From Natural Language Supervision

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study performance on over 30 different computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers nontrivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original Res Net50 on Image Net zero-shot without needing to use any of the 1.28 million training examples it was trained on.
Researcher Affiliation Industry Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1 1Open AI, San Francisco, CA 94110, USA. Correspondence to: <{alec, jongwook}@openai.com>.
Pseudocode Yes In Figure 3 we include pseudocode for the core of an implementation of CLIP.
Open Source Code Yes We release our code and pre-trained model weights at https://github.com/Open AI/CLIP.
Open Datasets No To test this we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet... We refer to this dataset as WIT for Web Image Text. No concrete access information (link, DOI, etc.) is provided for the constructed WIT dataset itself.
Dataset Splits No The paper states, 'Despite our emphasis on zero-shot transfer, we repeatedly queried performance on validation sets to guide development.' However, it does not provide specific dataset split information (e.g., exact percentages, sample counts, or citations to predefined splits) for training, validation, and test sets across all experiments, particularly for their main WIT dataset.
Hardware Specification Yes The largest Res Net model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs.
Software Dependencies Yes Finally, we d also like to thank the developers of the many software packages used throughout this project including, but not limited, to Numpy (Harris et al., 2020), Sci Py (Virtanen et al., 2020), ftfy (Speer, 2019), Tensor Flow (Abadi et al., 2016), Py Torch (Paszke et al., 2019), pandas (pandas development team, 2020), and scikit-learn (Pedregosa et al., 2011).
Experiment Setup Yes We train a series of 5 Res Nets and 3 Vision Transformers. For the Res Nets we train a Res Net50, a Res Net101... For the Vi T-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch... Full model hyperparameters and details are in supplementary material.