Unifying Vision-and-Language Tasks via Text Generation
Authors: Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. |
| Researcher Affiliation | Academia | Jaemin Cho 1 Jie Lei Hao Tan Mohit Bansal UNC Chapel Hill {jmincho,jielei,haotan,mbansal}@cs.unc.edu 1UNC Chapel Hill. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/j-min/VL-T5 |
| Open Datasets | Yes | We aggregate pretraining data from MS COCO (Lin et al., 2014; Chen et al., 2015) and Visual Genome (VG; Krishna et al. (2016)) images.3 The captioning data from these two datasets are used in the multimodal language modeling task. The COCO captions are also used in the image-text matching task to learn cross-modal alignment. Besides the captions, we also use three visual question answering datasets (VQA v2.0 (Goyal et al., 2019), GQA balanced version (Hudson & Manning, 2019), and Visual7W (Zhu et al., 2016)) as in Tan & Bansal (2019), but only used them for the visual question answering task. |
| Dataset Splits | Yes | We use Karparthy split (Karpathy & Fei-Fei, 2015), which re-splits train2014 and val2014 images (Lin et al., 2014) into 113,287 / 5000 / 5000 for train / validation / test. |
| Hardware Specification | Yes | For both VL-T5 and VL-BART, it takes 4 days for 30-epoch pretraining with mixed precision training (Narang et al., 2018) on 4 RTX 2080 Ti GPUs. |
| Software Dependencies | Yes | Our code is based on Py Torch (Paszke et al., 2017) and Huggingface Transformers (Wolf et al., 2019). |
| Experiment Setup | Yes | We use batch size 320 and 600 for VL-T5 and VL-BART, respectively. We use Adam W (Loshchilov & Hutter, 2019) with (β1, β2) = (0.9, 0.999) and learning rate 1e-4 with 5% linear warmup schedule. |