Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Authors: Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. The result shows that our approach not only outperforms the state-of-the-art VLP models, but also exhibits superiority on the IMF metric.
Researcher Affiliation Collaboration Hongwei Xue1 , Yupan Huang2 , Bei Liu3, Houwen Peng3, Jianlong Fu3, Houqiang Li1, Jiebo Luo4 1University of Science and Technology of China, Hefei, China, 2Sun Yat-sen University, Guangzhou, China, 3Microsoft Research, Beijing, China, 4University of Rochester, Rochester, NY
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No In the 'Questions for Paper Analysis' section, the authors state: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]'
Open Datasets Yes We follow the dataset settings of SOHO [21] for pre-training. We focus on in-domain datasets: MSCOCO Captions (MSCOCO) [33] and Visual Genome Dense Captions (VG) [29]... We present the experimental results on VQA v2.0 dataset in Table 2... Experiment results on Flickr30k [38] are shown in Table 1... For this task, we evaluate our model on NLVR2 dataset [41]... For this task, we evaluate our model on SNLI-VE dataset [45].
Dataset Splits Yes Table 6: Statistics of different tasks. Task Dataset Train Split Test Split Metric ... Visual Question Answering VQA2.0 [18] train+val test-dev/test-std ... Visual Reasoning NLVR2 [41] train dev/test-P.
Hardware Specification Yes Our model is pre-trained on 8 NVIDIA Tesla V100 GPUs with a batch size of 2048.
Software Dependencies No The paper mentions software components like 'Word Piece tokenizer', 'AdamW optimizers', 'Swin Transformer', and 'BERT' but does not specify version numbers for these or other software dependencies, such as programming languages or deep learning frameworks.
Experiment Setup Yes We use Adam W optimizers for vision Transformer with learning rate 5e-6 and multi-modal Transformer with learning rate 5e-5 empirically. Empirically, we set the k in Equation 9 as 7. We set the λ1 and λ2 in Equation 10 as 1 and 0.01 respectively. Our model is pre-trained on 8 NVIDIA Tesla V100 GPUs with a batch size of 2048. Following SOHO [21], the learning rate is warmed up for the first 500 iterations. The training process takes 40 epochs until convergence and the learning rate decays by 10 times at 25th and 35th epoch.