reproducibilityindex.ai

Unifying Vision-and-Language Tasks via Text Generation

Authors: Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single uniﬁed architecture) reaches comparable performance to recent task-speciﬁc state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers.
Researcher Affiliation	Academia	Jaemin Cho 1 Jie Lei Hao Tan Mohit Bansal UNC Chapel Hill {jmincho,jielei,haotan,mbansal}@cs.unc.edu 1UNC Chapel Hill.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	Yes	Our code is publicly available at: https://github.com/j-min/VL-T5
Open Datasets	Yes	We aggregate pretraining data from MS COCO (Lin et al., 2014; Chen et al., 2015) and Visual Genome (VG; Krishna et al. (2016)) images.3 The captioning data from these two datasets are used in the multimodal language modeling task. The COCO captions are also used in the image-text matching task to learn cross-modal alignment. Besides the captions, we also use three visual question answering datasets (VQA v2.0 (Goyal et al., 2019), GQA balanced version (Hudson & Manning, 2019), and Visual7W (Zhu et al., 2016)) as in Tan & Bansal (2019), but only used them for the visual question answering task.
Dataset Splits	Yes	We use Karparthy split (Karpathy & Fei-Fei, 2015), which re-splits train2014 and val2014 images (Lin et al., 2014) into 113,287 / 5000 / 5000 for train / validation / test.
Hardware Specification	Yes	For both VL-T5 and VL-BART, it takes 4 days for 30-epoch pretraining with mixed precision training (Narang et al., 2018) on 4 RTX 2080 Ti GPUs.
Software Dependencies	Yes	Our code is based on Py Torch (Paszke et al., 2017) and Huggingface Transformers (Wolf et al., 2019).
Experiment Setup	Yes	We use batch size 320 and 600 for VL-T5 and VL-BART, respectively. We use Adam W (Loshchilov & Hutter, 2019) with (β1, β2) = (0.9, 0.999) and learning rate 1e-4 with 5% linear warmup schedule.