reproducibilityindex.ai

Learning Transferable Visual Models From Natural Language Supervision

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study performance on over 30 different computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of ﬁne-grained object classiﬁcation. The model transfers nontrivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset speciﬁc training. For instance, we match the accuracy of the original Res Net50 on Image Net zero-shot without needing to use any of the 1.28 million training examples it was trained on.
Researcher Affiliation	Industry	Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1 1Open AI, San Francisco, CA 94110, USA. Correspondence to: <{alec, jongwook}@openai.com>.
Pseudocode	Yes	In Figure 3 we include pseudocode for the core of an implementation of CLIP.
Open Source Code	Yes	We release our code and pre-trained model weights at https://github.com/Open AI/CLIP.
Open Datasets	No	To test this we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet... We refer to this dataset as WIT for Web Image Text. No concrete access information (link, DOI, etc.) is provided for the constructed WIT dataset itself.
Dataset Splits	No	The paper states, 'Despite our emphasis on zero-shot transfer, we repeatedly queried performance on validation sets to guide development.' However, it does not provide specific dataset split information (e.g., exact percentages, sample counts, or citations to predefined splits) for training, validation, and test sets across all experiments, particularly for their main WIT dataset.
Hardware Specification	Yes	The largest Res Net model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs.
Software Dependencies	Yes	Finally, we d also like to thank the developers of the many software packages used throughout this project including, but not limited, to Numpy (Harris et al., 2020), Sci Py (Virtanen et al., 2020), ftfy (Speer, 2019), Tensor Flow (Abadi et al., 2016), Py Torch (Paszke et al., 2019), pandas (pandas development team, 2020), and scikit-learn (Pedregosa et al., 2011).
Experiment Setup	Yes	We train a series of 5 Res Nets and 3 Vision Transformers. For the Res Nets we train a Res Net50, a Res Net101... For the Vi T-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch... Full model hyperparameters and details are in supplementary material.