BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning
Authors: Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on different design choices for BRIDGETOWER and fine-tune it on various downstream VL tasks. Experimental results show that with only 4M images for pre-training, our model achieves state-of-the-art performance on various downstream VL tasks, especially 78.73% accuracy on the VQAv2 test-std set, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. |
| Researcher Affiliation | Collaboration | Xiao Xu1, 2*, Chenfei Wu2, Shachar Rosenman3, Vasudev Lal3, Wanxiang Che1 , Nan Duan2 1Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology 2Microsoft Research Asia 3Intel Labs, Cognitive Computing Research |
| Pseudocode | No | The paper describes the model architecture and training objectives in text and equations, but it does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and checkpoints are available at https://github. com/microsoft/Bridge Tower. |
| Open Datasets | Yes | We use four public image-caption datasets for pre-training: Conceptual Captions (CC) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011), MSCOCO Captions (Chen et al. 2015), and Visual Genome (VG) (Krishna et al. 2017). |
| Dataset Splits | Yes | For VQAv2, where we follow the common practice (Goyal et al. 2017; Teney et al. 2018): convert VQAv2 to a classification task with 3, 129 answer classes; train the model with training data and validation data, and evaluate the model on the test-dev data. We use an image resolution of 384 384 for these downstream VL tasks, except for VQAv2, where we use 576 576 for a robust evaluation and fair comparison with METER. Standard settings and splits are used for all datasets. |
| Hardware Specification | Yes | We pre-train BRIDGETOWER for 100k steps on 8 NVIDIA A100 GPUs with a batch size of 4, 096. |
| Software Dependencies | No | The paper mentions using RoBERTa, CLIP-ViT-B/16, and the AdamW optimizer, but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | BRIDGETOWER consists of a pre-trained textual encoder, Ro BERTa BASE with 124M parameters, a pre-trained visual encoder, CLIP-Vi T-B-224/16 with 86M parameters, and a random-initialized 6-layer cross-modal encoder with 113M parameters. For each layer of the cross-modal encoder, the hidden size is set to 768, the intermediate size of feed-forward networks is set to 3, 072, and the number of heads is set to 12. The maximum length of the text sequence is set to 50. The patch size is 16 16. We use the Adam W (Loshchilov and Hutter 2019) optimizer with a base learning rate of 2e 5 and weight decay of 0.01. The learning rate is warmed up for 10% of the total training steps and then decayed linearly to zero for the rest of the training steps. Following METER, the learning rate of the cross-modal encoder is five times higher than that of uni-modal encoders. We use an image resolution of 384 384 for these downstream VL tasks, except for VQAv2, where we use 576 576. We pre-train BRIDGETOWER for 100k steps on 8 NVIDIA A100 GPUs with a batch size of 4, 096. The learning rate is set to 1e 5. The image resolution in pre-training is set to 288 288. |