ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Authors: Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Vi LBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. ... We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks ... We observe significant improvements across tasks compared to existing task-specific models achieving state-of-the-art on all four tasks. ... Table 1 shows results across all transfer tasks and we highlight key findings below: Our architecture improves performance over a single-stream model. ... We also studied the impact of the size of the pretraining dataset. |
| Researcher Affiliation | Collaboration | Jiasen Lu1, Dhruv Batra1,3, Devi Parikh1,3, Stefan Lee1,2 1Georgia Institute of Technology, 2Oregon State University, 3Facebook AI Research |
| Pseudocode | No | The paper describes the model architecture and training tasks, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or a link to open-source code for the described methodology. |
| Open Datasets | Yes | To train our full Vi LBERT model, we apply the training tasks presented in Sec. 2.2 to the Conceptual Captions dataset [24]. ... We train and evaluate on the VQA 2.0 dataset [3] ... The Visual Commonsense Reasoning (VCR) dataset consists of 290k ... We train and evaluate on the Ref COCO+ dataset [32] ... We train and evaluate on the Flickr30k dataset [26]. |
| Dataset Splits | Yes | Flickr30k dataset [26] consisting of 31,000 images from Flickr with five captions each. Following the splits in [35], we use 1,000 images for validation and test each and train on the rest. |
| Hardware Specification | Yes | We train on 8 Titan X GPUs with a total batch size of 512 for 10 epochs. |
| Software Dependencies | No | The paper mentions models like BERTBASE, Faster R-CNN, and ResNet-101, but does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train on 8 Titan X GPUs with a total batch size of 512 for 10 epochs. We use the Adam optimizer with initial learning rates of 1e-4. We use a linear decay learning rate schedule with warm up to train the model. Both training task losses are weighed equally. ... For VQA: batch size of 256 over a maximum of 20 epochs. ... initial learning rate of 4e-5. ... For VCR: batch size of 64 and initial learning rate of 2e-5. ... For Grounding Referring Expressions: batch size of 256 and an initial learning rate of 4e-5. ... For Caption-Based Image Retrieval: batch size of 64 and an initial learning rate of 2e-5. |