BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning

Authors: Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on different design choices for BRIDGETOWER and fine-tune it on various downstream VL tasks. Experimental results show that with only 4M images for pre-training, our model achieves state-of-the-art performance on various downstream VL tasks, especially 78.73% accuracy on the VQAv2 test-std set, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs.
Researcher Affiliation Collaboration Xiao Xu1, 2*, Chenfei Wu2, Shachar Rosenman3, Vasudev Lal3, Wanxiang Che1 , Nan Duan2 1Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology 2Microsoft Research Asia 3Intel Labs, Cognitive Computing Research
Pseudocode No The paper describes the model architecture and training objectives in text and equations, but it does not provide any pseudocode or algorithm blocks.
Open Source Code Yes Code and checkpoints are available at https://github. com/microsoft/Bridge Tower.
Open Datasets Yes We use four public image-caption datasets for pre-training: Conceptual Captions (CC) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), MSCOCO Captions (Chen et al. 2015), and Visual Genome (VG) (Krishna et al. 2017).
Dataset Splits Yes For VQAv2, where we follow the common practice (Goyal et al. 2017; Teney et al. 2018): convert VQAv2 to a classification task with 3, 129 answer classes; train the model with training data and validation data, and evaluate the model on the test-dev data. We use an image resolution of 384 384 for these downstream VL tasks, except for VQAv2, where we use 576 576 for a robust evaluation and fair comparison with METER. Standard settings and splits are used for all datasets.
Hardware Specification Yes We pre-train BRIDGETOWER for 100k steps on 8 NVIDIA A100 GPUs with a batch size of 4, 096.
Software Dependencies No The paper mentions using RoBERTa, CLIP-ViT-B/16, and the AdamW optimizer, but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup Yes BRIDGETOWER consists of a pre-trained textual encoder, Ro BERTa BASE with 124M parameters, a pre-trained visual encoder, CLIP-Vi T-B-224/16 with 86M parameters, and a random-initialized 6-layer cross-modal encoder with 113M parameters. For each layer of the cross-modal encoder, the hidden size is set to 768, the intermediate size of feed-forward networks is set to 3, 072, and the number of heads is set to 12. The maximum length of the text sequence is set to 50. The patch size is 16 16. We use the Adam W (Loshchilov and Hutter 2019) optimizer with a base learning rate of 2e 5 and weight decay of 0.01. The learning rate is warmed up for 10% of the total training steps and then decayed linearly to zero for the rest of the training steps. Following METER, the learning rate of the cross-modal encoder is five times higher than that of uni-modal encoders. We use an image resolution of 384 384 for these downstream VL tasks, except for VQAv2, where we use 576 576. We pre-train BRIDGETOWER for 100k steps on 8 NVIDIA A100 GPUs with a batch size of 4, 096. The learning rate is set to 1e 5. The image resolution in pre-training is set to 288 288.