InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We learn a new video-language model, Vi CLIP, which is trained on Intern Vid using Vi T-L. It incorporates both contrastive learning and mask modeling, allowing for efficient learning of transferrable video-language representation. This model achieves state-of-the-art zero-shot action recognition in Kinetics, scoring 75.7, 73.5, and 66.4 on K400, K600, and K700 with the average top1 and top5 accuracies, respectively. It gets competitive performance on video retrieval, setting a new baseline for video-text understanding.
Researcher Affiliation Academia 1Open GVLab, Shanghai AI Laboratory 2Nanjing University 3Monash University 4The University of Hong Kong 5Nanyang Technological University 6Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Pseudocode Yes Here is a pseudocode example of this process:
Open Source Code Yes https://github.com/Open GVLab/Intern Video/tree/main/Data/Intern Vid
Open Datasets Yes This paper introduces Intern Vid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation.
Dataset Splits Yes We learn Vi CLIP on subsets of Intern Vid and evaluated its performance on video-related benchmarks using full-finetuned and zero-shot settings.
Hardware Specification Yes Vi CLIP is learned with 64 NVIDIA A100 GPUs for 3 days with 50M video-text pairs.
Software Dependencies No We introduce Deep Speed and Flash Attention (Dao et al., 2022) for training and inference.
Experiment Setup Yes Table 9: Video-text retrieval fine-tuning settings. (and the table contents itself, which details optimizer, learning rates, batch size, epochs, input frame count, max text length, drop path, augmentation)