reproducibilityindex.ai

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Authors: Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang3208-3216

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-Vi L on 5 cross-modal downstream tasks. ERNIE-Vi L achieves state-of-the-art performances on all these tasks and ranks the ﬁrst place on the VCR leaderboard with an absolute improvement of 3.7%. To evaluate the performance of ERNIE-Vi L, we conduct experiments on various vision-language tasks...
Researcher Affiliation	Industry	Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang Baidu Inc., Beijing, China {yufei07, tangjiji, yinweichong, sunyu02, tianhao, wu hua, wanghaifeng}@baidu.com
Pseudocode	No	The paper describes the steps and loss functions for its prediction tasks, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	And our code and pre-trained models are scheduled to be public. This indicates future availability, not current concrete access to source code.
Open Datasets	Yes	We use the Conceptual Captions (CC) dataset (Sharma et al. 2018) and SBU Captions (SBU) dataset (Ordonez, Kulkarni, and Berg 2011) as pre-training data.
Dataset Splits	Yes	Flickr30K (Young et al. 2014) contains 31,000 images and 5 captions for each image. Adopting the same split in Vi LBERT (Lu et al. 2019), we use each of 1,000 images for validation and for testing and the rest for training.
Hardware Specification	Yes	We train ERNIE-Vi L on a total batch size of 512 for 700k steps on 8 V100 GPUs
Software Dependencies	No	The paper mentions 'Paddle Paddle' as the implementation framework, and various models/techniques like 'Faster R-CNN' and 'Word Pieces', but no specific version numbers for these software dependencies are provided.
Experiment Setup	Yes	We train ERNIE-Vi L on a total batch size of 512 for 700k steps... using adam optimizer with initial learning rates of 1e-4... For the masking strategies, we randomly mask 15% of tokens, 30% of scene graph nodes, and 15% of image regions. For VCR: fine-tune the model over 6 epochs with a batch size of 64 and adopt Adam optimizer with initial learning rate of 1e-4.