OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Authors: Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Omni VL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.
Researcher Affiliation Collaboration 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 3Microsoft Cloud + AI, 4Microsoft Research Asia
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We will release the code after the paper is accepted
Open Datasets Yes For the image-text data, we adopt the same pre-training dataset as [38, 37] with 14M images in total by default, including two human-annotated datasets (COCO [43] and Visual Genome [34]), and three web datasets (CC3M [54], CC12M [11], and SBU captions [50]). For the video-text data, we use Web Vid [6] which contains 2.5M videos from the web. The visual-label datasets that we adopt includes the image dataset Image Net-1K [16] and video dataset Kinetics-400 [32].
Dataset Splits Yes We follow [77] to report the results on the validation sets in Table 7.
Hardware Specification Yes We run our experiments on a cluster of A100 GPUs.
Software Dependencies No The paper does not provide specific software dependency versions (e.g., library or framework versions).
Experiment Setup Yes For the image-language pretraining stage, we initialize spatial attention with Vi T-B/16 [21] pretrained on Image Net-1K [16]. We take random image crops of resolution 224 224 as inputs and apply Rand Augment [15]. The model is pretrained for 20 epochs using a batch size of 2880. For the joint pretraining, we sparsely sample 8 224 224 video clips, and train the model for 10 epochs with a batch size of 800 for video data and 2880 for image data. Our joint pretraining alternates batches between the image and video data. The model is optimized with Adam W [44] using a weight decay of 0.05. The learning rate is warmed-up to 3e-4 (image) / 8e-5 (joint) and decayed linearly with a rate of 0.85.