Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Authors: Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as Image Net and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks |
| Researcher Affiliation | Industry | 1Google Research. Correspondence to: Chao Jia <chaojia@google.com>, Yinfei Yang <yinfeiy@google.com>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using 'open-sourced implementation of Efficient Net' and 'BERT', but there is no explicit statement about releasing the source code for the ALIGN model or method described in this paper, nor is a link provided. |
| Open Datasets | Yes | In the existing literature, visual and vision-language representation learning are mostly studied separately with different training data sources. In the vision domain, pre-training on large-scale supervised data such as Image Net (Deng et al., 2009), Open Images (Kuznetsova et al., 2020), and JFT300M (Sun et al., 2017; Kolesnikov et al., 2020) has proven to be critical for improving performance on downstream tasks via transfer learning. ...vision-language pre-training datasets such as Conceptual Captions (Sharma et al., 2018), Visual Genome Dense Captions (Krishna et al., 2016), and Image BERT (Qi et al., 2020). |
| Dataset Splits | Yes | For MSCOCO, we evaluate on the 5K test set, and finetune on 82K training plus 30K additional validation images that are not in the 5K validation or 5K test sets. ...Each task is trained on 800 images and the hyperparameters are selected using the validation set of 200 images. After the sweep, the selected hyperparameters are used to train on the combined training and validation splits of 1000 images for each task. |
| Hardware Specification | Yes | We train the model on 1024 Cloud TPUv3 cores with 16 positive pairs on each core. |
| Software Dependencies | No | The paper mentions using 'Efficient Net', 'BERT', and 'LAMB optimizer', but it does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | The image encoder is trained at resolution of 289 289 pixels no matter what Efficient Net variant is used. We first resize input images to 346 346 resolution and then perform random crop (with additional random horizontal flip) in training and central crop in evaluation. For BERT we use wordpiece sequence of maximum 64 tokens... The softmax temperature variable is initialized as 1.0... and we use 0.1 as label smoothing parameter in the softmax losses. We use LAMB optimizer (You et al., 2020) with weight decay ratio 1e-5. The learning rate is warmed up linearly to 1e-3 from zero in 10k steps, and then linearly decay to zero in 1.2M steps ( 12 epochs). We train the model on 1024 Cloud TPUv3 cores with 16 positive pairs on each core. Therefore the total effective batch size is 16384. |