ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Authors: Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang3208-3216

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-Vi L on 5 cross-modal downstream tasks. ERNIE-Vi L achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%. To evaluate the performance of ERNIE-Vi L, we conduct experiments on various vision-language tasks...
Researcher Affiliation Industry Fei Yu*, Jiji Tang*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang Baidu Inc., Beijing, China {yufei07, tangjiji, yinweichong, sunyu02, tianhao, wu hua, wanghaifeng}@baidu.com
Pseudocode No The paper describes the steps and loss functions for its prediction tasks, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No And our code and pre-trained models are scheduled to be public. This indicates future availability, not current concrete access to source code.
Open Datasets Yes We use the Conceptual Captions (CC) dataset (Sharma et al. 2018) and SBU Captions (SBU) dataset (Ordonez, Kulkarni, and Berg 2011) as pre-training data.
Dataset Splits Yes Flickr30K (Young et al. 2014) contains 31,000 images and 5 captions for each image. Adopting the same split in Vi LBERT (Lu et al. 2019), we use each of 1,000 images for validation and for testing and the rest for training.
Hardware Specification Yes We train ERNIE-Vi L on a total batch size of 512 for 700k steps on 8 V100 GPUs
Software Dependencies No The paper mentions 'Paddle Paddle' as the implementation framework, and various models/techniques like 'Faster R-CNN' and 'Word Pieces', but no specific version numbers for these software dependencies are provided.
Experiment Setup Yes We train ERNIE-Vi L on a total batch size of 512 for 700k steps... using adam optimizer with initial learning rates of 1e-4... For the masking strategies, we randomly mask 15% of tokens, 30% of scene graph nodes, and 15% of image regions. For VCR: fine-tune the model over 6 epochs with a batch size of 64 and adopt Adam optimizer with initial learning rate of 1e-4.