VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Authors: Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. |
| Researcher Affiliation | Collaboration | Weijie Su1,2 , Xizhou Zhu1,2 , Yue Cao2, Bin Li1, Lewei Lu2, Furu Wei2, Jifeng Dai2 1University of Science and Technology of China 2Microsoft Research Asia |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is released at https://github.com/jackroos/VL-BERT. |
| Open Datasets | Yes | We pre-train VL-BERT on both visual-linguistic and text-only datasets1. Here we utilize the Conceptual Captions dataset (Sharma et al., 2018) as the visual-linguistic corpus. It contains around 3.3 million images annotated with captions... We utilize the Books Corpus (Zhu et al., 2015) and the English Wikipedia datasets, which are also utilized in pre-training BERT. |
| Dataset Splits | Yes | The released VCR dataset consists of 265k pairs of questions, answers, and rationales, over 100k unique movie scenes (100k images). They are split into training, validation, and test sets consisting of 213k questions and 80k images, 27k questions and 10k images, and 25k questions and 10k images, respectively. |
| Hardware Specification | Yes | Pre-training is conducted on 16 Tesla V100 GPUs for 250k iterations by SGD. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma & Ba, 2014)' but does not specify software versions for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | In SGD, Adam optimizer (Kingma & Ba, 2014) is applied, with base learning rate of 2 10 5, β1 = 0.9, β2 = 0.999, weight decay of 10 4, learning rate warmed up over the first 8,000 steps, and linear decay of the learning rate. All the parameters in VL-BERT and Fast R-CNN are jointly trained in both pre-training and fine-tuning phase. |