reproducibilityindex.ai

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Authors: Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classiﬁcation tasks such as Image Net and VTAB. The aligned visual and language representations enables zero-shot image classiﬁcation and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks
Researcher Affiliation	Industry	1Google Research. Correspondence to: Chao Jia <chaojia@google.com>, Yinfei Yang <yinfeiy@google.com>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using 'open-sourced implementation of Efﬁcient Net' and 'BERT', but there is no explicit statement about releasing the source code for the ALIGN model or method described in this paper, nor is a link provided.
Open Datasets	Yes	In the existing literature, visual and vision-language representation learning are mostly studied separately with different training data sources. In the vision domain, pre-training on large-scale supervised data such as Image Net (Deng et al., 2009), Open Images (Kuznetsova et al., 2020), and JFT300M (Sun et al., 2017; Kolesnikov et al., 2020) has proven to be critical for improving performance on downstream tasks via transfer learning. ...vision-language pre-training datasets such as Conceptual Captions (Sharma et al., 2018), Visual Genome Dense Captions (Krishna et al., 2016), and Image BERT (Qi et al., 2020).
Dataset Splits	Yes	For MSCOCO, we evaluate on the 5K test set, and ﬁnetune on 82K training plus 30K additional validation images that are not in the 5K validation or 5K test sets. ...Each task is trained on 800 images and the hyperparameters are selected using the validation set of 200 images. After the sweep, the selected hyperparameters are used to train on the combined training and validation splits of 1000 images for each task.
Hardware Specification	Yes	We train the model on 1024 Cloud TPUv3 cores with 16 positive pairs on each core.
Software Dependencies	No	The paper mentions using 'Efﬁcient Net', 'BERT', and 'LAMB optimizer', but it does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	The image encoder is trained at resolution of 289 289 pixels no matter what Efﬁcient Net variant is used. We ﬁrst resize input images to 346 346 resolution and then perform random crop (with additional random horizontal ﬂip) in training and central crop in evaluation. For BERT we use wordpiece sequence of maximum 64 tokens... The softmax temperature variable is initialized as 1.0... and we use 0.1 as label smoothing parameter in the softmax losses. We use LAMB optimizer (You et al., 2020) with weight decay ratio 1e-5. The learning rate is warmed up linearly to 1e-3 from zero in 10k steps, and then linearly decay to zero in 1.2M steps ( 12 epochs). We train the model on 1024 Cloud TPUv3 cores with 16 positive pairs on each core. Therefore the total effective batch size is 16384.