OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
Authors: Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Omni VL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale. |
| Researcher Affiliation | Collaboration | 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 3Microsoft Cloud + AI, 4Microsoft Research Asia |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code after the paper is accepted |
| Open Datasets | Yes | For the image-text data, we adopt the same pre-training dataset as [38, 37] with 14M images in total by default, including two human-annotated datasets (COCO [43] and Visual Genome [34]), and three web datasets (CC3M [54], CC12M [11], and SBU captions [50]). For the video-text data, we use Web Vid [6] which contains 2.5M videos from the web. The visual-label datasets that we adopt includes the image dataset Image Net-1K [16] and video dataset Kinetics-400 [32]. |
| Dataset Splits | Yes | We follow [77] to report the results on the validation sets in Table 7. |
| Hardware Specification | Yes | We run our experiments on a cluster of A100 GPUs. |
| Software Dependencies | No | The paper does not provide specific software dependency versions (e.g., library or framework versions). |
| Experiment Setup | Yes | For the image-language pretraining stage, we initialize spatial attention with Vi T-B/16 [21] pretrained on Image Net-1K [16]. We take random image crops of resolution 224 224 as inputs and apply Rand Augment [15]. The model is pretrained for 20 epochs using a batch size of 2880. For the joint pretraining, we sparsely sample 8 224 224 video clips, and train the model for 10 epochs with a batch size of 800 for video data and 2880 for image data. Our joint pretraining alternates batches between the image and video data. The model is optimized with Adam W [44] using a weight decay of 0.05. The learning rate is warmed-up to 3e-4 (image) / 8e-5 (joint) and decayed linearly with a rate of 0.85. |