Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Authors: Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | DAVINCI achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. |
| Researcher Affiliation | Collaboration | Shizhe Diao The Hong Kong University of Science and Technology sdiaoaa@connect.ust.hk Wangchunshu Zhou Byte Dance AI Lab wangchunshu.zhou@inf.ethz.ch Xinsong Zhang Byte Dance AI Lab zhangxinsong.0320@bytedance.com Jiawei Wang Shanghai Jiao Tong University wjw_sjt@sjtu.edu.cn |
| Pseudocode | No | The paper includes illustrations of the architecture (Figure 1) and descriptions of the pre-training procedures but does not contain a formal pseudocode or algorithm block. |
| Open Source Code | Yes | 1The code and pre-trained models are available at https://github.com/shizhediao/Da Vinci. |
| Open Datasets | Yes | In-Domain Data (ID) COCO, Visual Genome COCO 1.3M Small-scale Web Data (SWD) SBU, CC-3M, CC-12M Web 14.9M Object-Region Data (ORD) VG regions, VG objects, COCO objects, Refcoco, Open Image, Obj365 COCO, Flickr 17.0M Vision Data (VD) Image Net-21K Image Net 13.2M Large-scale Web Data (LWD) LAION-400M, DAVINCI-200M Web 601.3M Text Data (TD) C4 Web 800GB |
| Dataset Splits | Yes | We test our models ability and versatility on five dimensions: language understanding on 8 GLUE tasks (Wang et al., 2019), vision understanding on Image Net fine-tuning and 12 popular vision datasets for linear evaluation, multi-modal understanding on VQAv2 (Goyal et al., 2017b), SNLIVE (Xie et al., 2019) and NLVR2 (Suhr et al., 2019), text-to-image generation on COCO (Chen et al., 2015), and image-to-text generation on COCO, No Caps (Agrawal et al., 2019), and VLUE (Zhou et al., 2022b). Under few-shot setting, we fine-tune a pre-trained model for 3 epochs on 1% training data. |
| Hardware Specification | Yes | All pre-training experiments are conducted on 32GB NVIDIA V100 GPUs. The model trained on the largest data takes around 10 days on 1024 V100 GPUs. All fine-tuning experiments are conducted on 32GB NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer', 'Word Piece', 'Res Net-101', and 'VQGAN' but does not specify their version numbers or the versions of general software frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | Our model is a base-size Transformer implemented with a 6-layer encoder and a 6-layer decoder, 768 dimensions for hidden states, 512 for maximum input length, and 3072 for intermediate size. We train our model from scratch without initializing the Transformer encoder and decoder. However, the image encoder is initialized from Res Net-101 (He et al., 2016) with Image Net weights... The learning rate is 2e-4 with a warm-up period for the first 2% steps and linearly decayed to 0 after 2% of the total training steps. In each batch, there are 8,192 image-text pairs... The default settings are shown in Table 6. We adopt dynamic masking in our experiments, where the masking ratio is randomly sampled from a uniform distribution U(0, 1). |