reproducibilityindex.ai

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Authors: Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present Vi LBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. ... We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks ... We observe signiﬁcant improvements across tasks compared to existing task-speciﬁc models achieving state-of-the-art on all four tasks. ... Table 1 shows results across all transfer tasks and we highlight key ﬁndings below: Our architecture improves performance over a single-stream model. ... We also studied the impact of the size of the pretraining dataset.
Researcher Affiliation	Collaboration	Jiasen Lu1, Dhruv Batra1,3, Devi Parikh1,3, Stefan Lee1,2 1Georgia Institute of Technology, 2Oregon State University, 3Facebook AI Research
Pseudocode	No	The paper describes the model architecture and training tasks, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or a link to open-source code for the described methodology.
Open Datasets	Yes	To train our full Vi LBERT model, we apply the training tasks presented in Sec. 2.2 to the Conceptual Captions dataset [24]. ... We train and evaluate on the VQA 2.0 dataset [3] ... The Visual Commonsense Reasoning (VCR) dataset consists of 290k ... We train and evaluate on the Ref COCO+ dataset [32] ... We train and evaluate on the Flickr30k dataset [26].
Dataset Splits	Yes	Flickr30k dataset [26] consisting of 31,000 images from Flickr with ﬁve captions each. Following the splits in [35], we use 1,000 images for validation and test each and train on the rest.
Hardware Specification	Yes	We train on 8 Titan X GPUs with a total batch size of 512 for 10 epochs.
Software Dependencies	No	The paper mentions models like BERTBASE, Faster R-CNN, and ResNet-101, but does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train on 8 Titan X GPUs with a total batch size of 512 for 10 epochs. We use the Adam optimizer with initial learning rates of 1e-4. We use a linear decay learning rate schedule with warm up to train the model. Both training task losses are weighed equally. ... For VQA: batch size of 256 over a maximum of 20 epochs. ... initial learning rate of 4e-5. ... For VCR: batch size of 64 and initial learning rate of 2e-5. ... For Grounding Referring Expressions: batch size of 256 and an initial learning rate of 4e-5. ... For Caption-Based Image Retrieval: batch size of 64 and an initial learning rate of 2e-5.