ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs
Authors: Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang3208-3216
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-Vi L on 5 cross-modal downstream tasks. ERNIE-Vi L achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%. To evaluate the performance of ERNIE-Vi L, we conduct experiments on various vision-language tasks... |
| Researcher Affiliation | Industry | Fei Yu*, Jiji Tang*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang Baidu Inc., Beijing, China {yufei07, tangjiji, yinweichong, sunyu02, tianhao, wu hua, wanghaifeng}@baidu.com |
| Pseudocode | No | The paper describes the steps and loss functions for its prediction tasks, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | And our code and pre-trained models are scheduled to be public. This indicates future availability, not current concrete access to source code. |
| Open Datasets | Yes | We use the Conceptual Captions (CC) dataset (Sharma et al. 2018) and SBU Captions (SBU) dataset (Ordonez, Kulkarni, and Berg 2011) as pre-training data. |
| Dataset Splits | Yes | Flickr30K (Young et al. 2014) contains 31,000 images and 5 captions for each image. Adopting the same split in Vi LBERT (Lu et al. 2019), we use each of 1,000 images for validation and for testing and the rest for training. |
| Hardware Specification | Yes | We train ERNIE-Vi L on a total batch size of 512 for 700k steps on 8 V100 GPUs |
| Software Dependencies | No | The paper mentions 'Paddle Paddle' as the implementation framework, and various models/techniques like 'Faster R-CNN' and 'Word Pieces', but no specific version numbers for these software dependencies are provided. |
| Experiment Setup | Yes | We train ERNIE-Vi L on a total batch size of 512 for 700k steps... using adam optimizer with initial learning rates of 1e-4... For the masking strategies, we randomly mask 15% of tokens, 30% of scene graph nodes, and 15% of image regions. For VCR: fine-tune the model over 6 epochs with a batch size of 64 and adopt Adam optimizer with initial learning rate of 1e-4. |