reproducibilityindex.ai

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Authors: Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang11336-11344

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the crossmodal pre-training. Experiments In this section, we describe how we pre-train our model and show the evaluation details on image-text retrieval task to which we transfer the pre-trained model. Ablation Studies In this section, we perform ablation experiments in order to better understand the effect of the model size and the pretrain dataset size.
Researcher Affiliation	Collaboration	Gen Li,1 Nan Duan,2 Yuejian Fang,1 Ming Gong,3 Daxin Jiang3 1School of Software & Microelectronics, Peking University, Beijing, China 2Natural Language Computing, Microsoft Research Asia, Beijing, China 3STCA NLP Group, Microsoft, Beijing, China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code	No	The paper does not include an unambiguous statement about releasing code or a direct link to a source-code repository for the described methodology.
Open Datasets	Yes	Pre-training Unicoder-VL Conceptual Captions dataset (Sharma et al. 2018) contains about 3.3M image and caption pairs harvested from the web, which are very suitable for our cross-modal pre-training. ...SBU Captions (Ordonez, Kulkarni, and Berg 2011) dataset is also automatically collected from Web and contains 1M image-caption pairs.
Dataset Splits	Yes	MSCOCO consists of 123,287 images, and each image contains roughly ﬁve textual descriptions. It is split into 82,783 training images, 5,000 validation images and 5,000 testing images... Flickr30K contains 31,783 images collected from the Flickr website. Following (Karpathy and Fei-Fei 2015), we split the dataset into 29,783 training images, 1,000 validation images and 1,000 testing images.
Hardware Specification	Yes	During Pre-training, our experiments are running on 4 NVIDIA Tesla V100 GPU.
Software Dependencies	No	The paper mentions ADAM optimizer, Faster R-CNN, and BERT-base, but does not specify version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Our model has 12 layers of Transformer blocks, where each block has 768 hidden units and 12 self-attention heads. The maximum sequence length is set as 144. ... ADAM optimizer with learning rate of 1e-4 with a batch size of 192 with gradient accumulation (every 4 steps). ... We trained over 20 epochs with a batch size of 48 and initial learning rate of 3e-5.