How Much Can CLIP Benefit Vision-and-Language Tasks?

Authors: Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as Bottom Up-Top Down. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
Researcher Affiliation Academia University of California, Berkeley, University of California, Los Angeles University of North Carolina at Chapel Hill
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We provide the code to reproduce the main results in this paper in the supplementary material, which contains comprehensive instructions to reproduce our results. The code and model checkpoints will be made public.
Open Datasets Yes We evaluate on VQA v2.0 (Goyal et al., 2017)... We evaluate our model on COCO dataset (Chen et al., 2015)... We apply our model to two vision-and-language navigation datasets: Room-to Room (R2R, Anderson et al. (2018b)) and Room-across-Room (Rx R, Ku et al. (2020))... We use the same corpora aggregated from MS COCO Captions (Chen et al., 2015), Visual Genome Captions (Krishna et al., 2017), VQA (Antol et al., 2015), GQA (Hudson and Manning, 2019), and VG-QA (Zhu et al., 2016) for pre-training.
Dataset Splits Yes R2R is built on the indoor environments from the Matter Port3D dataset (Chang et al., 2017). The environments are split into training, unseen validation, and unseen test.
Hardware Specification Yes The models are trained on one RTX 2080 Ti GPU... The model is trained on 8 Nvidia A100 GPUs and the pre-training takes around 5 days.
Software Dependencies No The paper mentions software like Adam W (Loshchilov and Hutter, 2017), Detectron2, and Stanza tokenizers (Qi et al., 2020a) but does not provide specific version numbers for these or other key software components.
Experiment Setup Yes We pre-train with a batch size of 512. The Transformer is initialized from BERTBASE and optimized with an Adam W (Loshchilov and Hutter, 2017) optimizer. We use a linearly-decaying schedule and a peak learning rate of 1e-4 for the model with CLIP-Res50 and 5e-5 for the model with CLIP-Res50x4. The Res Net is initialized from CLIP and we use SGD with a learning rate of 3e-3. We decay the learning rate of SGD at epochs 12, 17 by a factor of 10... We fine-tune the model with the binary cross-entropy loss for 5 epoch with a batch size of 256. The Transformer is optimized with Adam W and a peak learning rate of 5e-5. The Res Net is optimized with SGD and an initial learning rate of 1e-3. We decay the learning rate of Res Net by a factor of 10 after epoch 3.