Unifying Vision-and-Language Tasks via Text Generation

Authors: Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers.
Researcher Affiliation Academia Jaemin Cho 1 Jie Lei Hao Tan Mohit Bansal UNC Chapel Hill {jmincho,jielei,haotan,mbansal}@cs.unc.edu 1UNC Chapel Hill.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes Our code is publicly available at: https://github.com/j-min/VL-T5
Open Datasets Yes We aggregate pretraining data from MS COCO (Lin et al., 2014; Chen et al., 2015) and Visual Genome (VG; Krishna et al. (2016)) images.3 The captioning data from these two datasets are used in the multimodal language modeling task. The COCO captions are also used in the image-text matching task to learn cross-modal alignment. Besides the captions, we also use three visual question answering datasets (VQA v2.0 (Goyal et al., 2019), GQA balanced version (Hudson & Manning, 2019), and Visual7W (Zhu et al., 2016)) as in Tan & Bansal (2019), but only used them for the visual question answering task.
Dataset Splits Yes We use Karparthy split (Karpathy & Fei-Fei, 2015), which re-splits train2014 and val2014 images (Lin et al., 2014) into 113,287 / 5000 / 5000 for train / validation / test.
Hardware Specification Yes For both VL-T5 and VL-BART, it takes 4 days for 30-epoch pretraining with mixed precision training (Narang et al., 2018) on 4 RTX 2080 Ti GPUs.
Software Dependencies Yes Our code is based on Py Torch (Paszke et al., 2017) and Huggingface Transformers (Wolf et al., 2019).
Experiment Setup Yes We use batch size 320 and 600 for VL-T5 and VL-BART, respectively. We use Adam W (Loshchilov & Hutter, 2019) with (β1, β2) = (0.9, 0.999) and learning rate 1e-4 with 5% linear warmup schedule.