Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Authors: Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang11336-11344

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the crossmodal pre-training. Experiments In this section, we describe how we pre-train our model and show the evaluation details on image-text retrieval task to which we transfer the pre-trained model. Ablation Studies In this section, we perform ablation experiments in order to better understand the effect of the model size and the pretrain dataset size.
Researcher Affiliation Collaboration Gen Li,1 Nan Duan,2 Yuejian Fang,1 Ming Gong,3 Daxin Jiang3 1School of Software & Microelectronics, Peking University, Beijing, China 2Natural Language Computing, Microsoft Research Asia, Beijing, China 3STCA NLP Group, Microsoft, Beijing, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code No The paper does not include an unambiguous statement about releasing code or a direct link to a source-code repository for the described methodology.
Open Datasets Yes Pre-training Unicoder-VL Conceptual Captions dataset (Sharma et al. 2018) contains about 3.3M image and caption pairs harvested from the web, which are very suitable for our cross-modal pre-training. ...SBU Captions (Ordonez, Kulkarni, and Berg 2011) dataset is also automatically collected from Web and contains 1M image-caption pairs.
Dataset Splits Yes MSCOCO consists of 123,287 images, and each image contains roughly five textual descriptions. It is split into 82,783 training images, 5,000 validation images and 5,000 testing images... Flickr30K contains 31,783 images collected from the Flickr website. Following (Karpathy and Fei-Fei 2015), we split the dataset into 29,783 training images, 1,000 validation images and 1,000 testing images.
Hardware Specification Yes During Pre-training, our experiments are running on 4 NVIDIA Tesla V100 GPU.
Software Dependencies No The paper mentions ADAM optimizer, Faster R-CNN, and BERT-base, but does not specify version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Our model has 12 layers of Transformer blocks, where each block has 768 hidden units and 12 self-attention heads. The maximum sequence length is set as 144. ... ADAM optimizer with learning rate of 1e-4 with a batch size of 192 with gradient accumulation (every 4 steps). ... We trained over 20 epochs with a batch size of 48 and initial learning rate of 3e-5.